Skip to main content

Elements: Infrastructure for Accurate Data Extraction from Research Papers Using Large Language Models

NSF

open

About This Grant

Technologies that are essential for modern life, from computers to jet engines, rely on advanced materials. Easily accessible data on materials properties is critical to allowing developers of technologies to determine what materials they need and to helping guide researchers in the development of new materials. However, materials data is typically shared through millions of scientific papers, making it difficult to develop a database of relevant properties. The recent advent of large language models (LLMs), like ChatGPT, now make it possible to automate the complex task of reading thousands of papers and extracting key data. This project refines the use LLMs to extract data and related knowledge from scientific papers, including from text, tables, and plots. The project then develops an easy to use web interface to allow people to apply these methods and quickly access large amounts of automatically curated materials data. If successful, it will allow entrepreneurs, engineers, and scientists to quickly extract specialized curated databases from the vast scientific literature and help accelerate technological developments across the many industries that used advanced materials. The platform allows users to extract structured materials data from text, tables, and plots, and extract complex Processing-Structure-Property-Performance (PSPP) relationships. Designed for accessibility, the interface will require no expertise in LLMs or coding, making it a powerful tool for researchers across disciplines. The service will utilize state-of-the-art LLM text and image capabilities and integrate cutting-edge LLM workflows, including prompt engineering, chain-of-thought reasoning, and retrieval-augmented generation, supported by a robust backend architecture (FastAPI, React.js, Material-UI). The Intellectual Merit of the project includes both the development of practical methods for accurate materials data extraction with LLMs and development of online resources to deliver those methods to non-experts. The project develops methods to overcome limitations in existing data extraction techniques by leveraging LLMs advanced capabilities in zero-shot learning, chain-of-thought reasoning, and multimodal data analysis. Key broader impacts include (i) advancing data-centric science by enabling rapid database creation, (ii) improving education by putting outcomes into summer schools, conferences, and courses, (iii) increasing training of undergraduates in hands-on research building skills in coding, machine learning, and project management, and (iv) increasing training for graduate students in both materials science and advanced data science, preparing them for leadership in these critical fields. This award by the Office of Advanced Cyberinfrastructure is jointly supported by the Division of Graduate Education within the Directorate for STEM Education and Division of Material Research within the Directorate of Mathematical and Physical Science. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Focus Areas

machine learningengineeringeducation

Eligibility

universitynonprofitsmall business

How to Apply

Funding Range

Up to $599K

Deadline

2028-08-31

Complexity
Medium
Start Application

One-time $749 fee · Includes AI drafting + templates + PDF export

AI Requirement Analysis

Detailed requirements not yet analyzed

Have the NOFO? Paste it below for AI-powered requirement analysis.

0 characters (min 50)