NSF AI Disclosure Required
NSF requires disclosure of AI tool usage in proposal preparation. Ensure you disclose the use of FindGrants' AI drafting in your application.
III: Medium: SMARTCAT: Developing Smart Data Catalogs for Data Science and AI
NSF
About This Grant
The world has become data driven. Organizations, such as companies, domain sciences, and government agencies, increasingly have numerous datasets, scattered in many locations. When starting a data science or AI project, users often must find a specific datasets, then analyze them to extract insights. However, finding the needed datasets among a “sea of datasets” is often very difficult. So organizations increasingly use data catalogs for this purpose. A data catalog stores the names, descriptions, and other characteristics of datasets, as well the relationships among them. Users can then query the catalog to find desired datasets. As such, data catalogs have become a critical enabler for data science and AI projects. Yet the state of the art in catalog development has remained quite limited, leading to underwhelming performance that falls short of the users’ needs. In particular, not enough attention is devoted to the “pain points” of catalog users, and there is very little interaction among the research, vendor, user, and open-source tool communities. This has negatively impacted users, especially in domain sciences, with anecdotal evidence of intensive manual work to construct catalogs. This project seeks to address these limitations by first developing innovative and practical solutions for several pain points of catalog users, thereby accelerating research on these critical topics. Second, the project will combine these solutions to build SmartCat, a catalog software, and open-source SmartCat for educational and research purposes, and to serve domain science users. These technologies will facilitate the widespread deployment of data catalogs, resulting in better data science and AI for society, especially for domain sciences. Findings from the project will be incorporated into a new course, a new textbook, and a proposed computing-oriented (CS+X) major for environmental sciences. This will help improve STEM education. Finally, the project will build on the above activities to promote bridges among stakeholder communities that work on data catalogs, thereby increasing partnership between academia, industry, and others. Toward the above goals, this project considers a long-term agenda that spans the stakeholder communities (research, vendor, user, open-source tool) and applies Generative AI, especially large language models (LLMs), to solve catalog problems. The project will develop solutions to (1) enrich datasets with many types of metadata, such as table and column name expansions, textual descriptions, and tags, (2) discover relationships among the datasets, such as unionable, joinable, and lineage relationships, and (3) help users find desired datasets via browsing, keyword search, and natural language querying. These solutions will significantly advance the state of the art in data catalog, data discovery, and data integration, as well as the application of Generative AI to data management. The project introduces new core problems that are serious “pain points” for catalog users, but have received little attention from the research community. It considers problems that have been extensively studied, but points out the limitations of existing approaches, and proposes novel solutions. Like many current works, it also considers LLMs, but does so in the context of a real-world data management problem, namely data catalog. This uncovers serious novel challenges that studying LLMs in isolation does not. Finally, the project studies how to combine LLMs with a variety of other techniques, such as human-centric data management, traditional machine learning (ML), and big data scaling techniques, to build practical data catalog solutions. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Focus Areas
Eligibility
How to Apply
Up to $1M
2029-06-30
One-time $749 fee · Includes AI drafting + templates + PDF export
AI Requirement Analysis
Detailed requirements not yet analyzed
Have the NOFO? Paste it below for AI-powered requirement analysis.