NSF AI Disclosure Required
NSF requires disclosure of AI tool usage in proposal preparation. Ensure you disclose the use of FindGrants' AI drafting in your application.
CAREER: Embracing Uncertainty in High-Performance Computing Resource Scheduling: An Integrated Algorithmic and Machine Learning-based Approach
NSF
About This Grant
Resource scheduling is a critical component of high-performance computing (HPC) systems. Despite extensive literature on scheduling, new challenges continue to arise due to advancements in hardware, software, and evolving models, metrics, and performance demands. Today’s HPC systems operate on an unprecedented scale, presenting significant challenges for resource management, particularly when facing uncertainty introduced by emerging application characteristics and system-level complexities. Existing schedulers lack robust mechanisms to effectively handle uncertainty, limiting their ability to achieve optimal performance. This project takes on the grand challenge of scheduling HPC resources under uncertainty by introducing an integrated approach that combines algorithm and machine learning (ML). The approach leverages the rigor of algorithmic analysis to provide performance guarantees while utilizing ML’s predictive capabilities to manage uncertainty effectively. The anticipated outcome is a substantial enhancement to current HPC schedulers, enabling more efficient execution of a diverse range of scientific applications, such as neuroscience, medical research, climate modeling, and artificial intelligence. Additionally, the project includes a series of synergistic activities, including outreach programs, curriculum development, and student recruitment, aimed at engaging students from K-12 through graduate levels. These efforts focus particularly on underrepresented and underserved communities, offering research opportunities that foster success in STEM and CS education. Technically, this project aims to design, implement, and evaluate scheduling algorithms that integrate ML prediction models to enhance efficiency. The focus will be on addressing three primary sources of uncertainty: (1) inherent runtime variability of emerging applications; (2) resource contention in job co-scheduling; and (3) structural variations within dynamic workflows. These aspects represent uncertainties across temporal, spatial, and structural dimensions, all of which demand solutions due to their growing prevalence in modern HPC environments. Algorithmically, approximation and semi-online algorithms will be developed to provide performance guarantees relative to theoretical lower bounds for metrics such as job completion time and resource utilization. On the ML front, various models, including those based on regression and reinforcement learning, will be trained to deliver accurate predictions for job runtime, performance degradation, and structural variability. A key ambition of this project is to establish an incubation framework that enables the effective integration of heuristic-based algorithms and data-driven ML models. This approach aims to achieve a level of performance that neither paradigm could accomplish independently. The framework will offer a novel perspective on resource management and potentially set the stage for future HPC advancements. This project is jointly funded by Software and Hardware Foundations and the Established Program to Stimulate Competitive Research (EPSCoR). This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Focus Areas
Eligibility
How to Apply
Up to $324K
2030-09-30
One-time $749 fee · Includes AI drafting + templates + PDF export
AI Requirement Analysis
Detailed requirements not yet analyzed
Have the NOFO? Paste it below for AI-powered requirement analysis.