Skip to main content

CAREER: Embracing Uncertainty in High-Performance Computing Resource Scheduling: An Integrated Algorithmic and Machine Learning-based Approach

NSF

open

About This Grant

Resource scheduling is a critical component of high-performance computing (HPC) systems. Despite extensive literature on scheduling, new challenges continue to arise due to advancements in hardware, software, and evolving models, metrics, and performance demands. Today’s HPC systems operate on an unprecedented scale, presenting significant challenges for resource management, particularly when facing uncertainty introduced by emerging application characteristics and system-level complexities. Existing schedulers lack robust mechanisms to effectively handle uncertainty, limiting their ability to achieve optimal performance. This project takes on the grand challenge of scheduling HPC resources under uncertainty by introducing an integrated approach that combines algorithm and machine learning (ML). The approach leverages the rigor of algorithmic analysis to provide performance guarantees while utilizing ML’s predictive capabilities to manage uncertainty effectively. The anticipated outcome is a substantial enhancement to current HPC schedulers, enabling more efficient execution of a diverse range of scientific applications, such as neuroscience, medical research, climate modeling, and artificial intelligence. Additionally, the project includes a series of synergistic activities, including outreach programs, curriculum development, and student recruitment, aimed at engaging students from K-12 through graduate levels. These efforts focus particularly on underrepresented and underserved communities, offering research opportunities that foster success in STEM and CS education. Technically, this project aims to design, implement, and evaluate scheduling algorithms that integrate ML prediction models to enhance efficiency. The focus will be on addressing three primary sources of uncertainty: (1) inherent runtime variability of emerging applications; (2) resource contention in job co-scheduling; and (3) structural variations within dynamic workflows. These aspects represent uncertainties across temporal, spatial, and structural dimensions, all of which demand solutions due to their growing prevalence in modern HPC environments. Algorithmically, approximation and semi-online algorithms will be developed to provide performance guarantees relative to theoretical lower bounds for metrics such as job completion time and resource utilization. On the ML front, various models, including those based on regression and reinforcement learning, will be trained to deliver accurate predictions for job runtime, performance degradation, and structural variability. A key ambition of this project is to establish an incubation framework that enables the effective integration of heuristic-based algorithms and data-driven ML models. This approach aims to achieve a level of performance that neither paradigm could accomplish independently. The framework will offer a novel perspective on resource management and potentially set the stage for future HPC advancements. This project is jointly funded by Software and Hardware Foundations and the Established Program to Stimulate Competitive Research (EPSCoR). This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Focus Areas

machine learningclimateeducation

Eligibility

universitynonprofitsmall business

How to Apply

Funding Range

Up to $324K

Deadline

2030-09-30

Complexity
Medium
Start Application

One-time $749 fee · Includes AI drafting + templates + PDF export

AI Requirement Analysis

Detailed requirements not yet analyzed

Have the NOFO? Paste it below for AI-powered requirement analysis.

0 characters (min 50)