FAQ: Using Large Language Models to Extract Experimental Data from Scientific Papers for Materials Science

By NewsRamp Editorial Team•January 9, 2026

TL;DR

NIMS researchers developed LLM tools to accelerate materials database construction, giving scientists a competitive edge in discovering new functional materials faster than traditional methods.

The Starrydata project uses LLMs to extract structured data from scientific papers, automating the conversion of complex information into organized databases for materials property analysis.

By digitizing and sharing experimental data globally, this research accelerates materials development for sustainable technologies, potentially improving energy efficiency and environmental solutions worldwide.

Researchers are using AI like ChatGPT to mine millions of scientific papers, transforming untapped experimental data into searchable databases that reveal hidden patterns in materials science.

FAQ: Using Large Language Models to Extract Experimental Data from Scientific Papers for Materials Science

The research focuses on using large language models (LLMs) to accelerate the construction of materials property databases by automatically extracting experimental data from scientific papers, specifically for the Starrydata project.

Millions of published papers contain valuable experimental data collected by past researchers that remains untapped, and building large-scale datasets from this data enables researchers to gain inspiration through data overviews and realize property predictions using machine learning.

By specifying a data structure and giving instructions to an LLM, the tools can accurately and comprehensively extract information about figures, tables, and samples from the text of paper PDFs across various fields.

A team led by Dr. Yukari Katsura, a Senior Researcher at the National Institute for Materials Science (NIMS) in Japan, developed the tools, and the work was published in the journal Science and Technology of Advanced Materials: Methods.

The first tool is Starrydata Auto-Suggestion for Sample Information, which reads paper text and suggests candidate entries for data fields. The second is Starrydata Auto-Summary GPT, which deconstructs entire open-access paper PDFs and automatically summarizes them.

Many publishers prohibit the use of artificial intelligence on paper PDFs, so the system is currently being developed to target open-access papers specifically.

When a user pastes text from a paper's abstract or experimental methods section into the Starrydata2 web system, it is sent to OpenAI's GPT via API, and candidate entries in English are automatically displayed below each input field.

This approach automates the process of converting complex information sources like scientific papers into structured data, which was previously done manually, thereby accelerating data collection and enabling unprecedented volumes of experimental data to be amassed.

Starrydata is a materials property database built from data collected from scientific papers, launched in 2015 by Prof. Katsura, with data collection initially performed manually using the Starrydata2 web system.

Curated from NewMediaWire

Original News Release

blockchain registration record for this content

NewsRamp Editorial Team

@newsramp

NewsRamp is a PR & Newswire Technology platform that enhances press release distribution by adapting content to align with how and where audiences consume information. Recognizing that most internet activity occurs outside of search, NewsRamp improves content discovery by programmatically curating press releases into multiple unique formats—news articles, blog posts, persona-based TLDRs, videos, audio, and Zero-Click content—and distributing this content through a network of news sites, blogs, forums, podcasts, video platforms, newsletters, and social media.

Website LinkedIn X/Twitter