Skip to main content

CC* Integration-Large: Scaling Scientific Workloads on Distributed Commodity GPUs and Storage through Campus-level RDMA Networking

NSF

open

About This Grant

Scientific workloads have outgrown the capabilities of today's campus networks, driven by two key trends: the increasing adoption of machine learning (ML) in scientific research, and the growing need to access and process large-scale datasets. Remote Direct Memory Access (RDMA) has emerged as a key network technology to provide high-bandwidth, low-latency communication for distributed ML and fast data storage. This project explores an RDMA-based campus network design and implementation that enables the shared use of distributed, heterogeneous Graphics Processing Units (GPUs) to accelerate scientific applications and fast access to research data storage. The project entails four research thrusts. First, high-bandwidth, low-latency RDMA network infrastructure will be established to connect campus GPUs using standard data center-class network hardware. Second, new workload scheduling systems and algorithms will be developed to make efficient usage of the RDMA network. Third, storage disaggregation over RDMA will be enabled, allowing compute servers to access remote NVMe-class storage with minimal performance overheads. Finally, varied science applications, such as large language models (LLMs), domain-specific natural language processing (NLP), medical image processing, and cryo-electron microscopy (CryoEM) will be evaluated on top of the RDMA network, the workload scheduler, and the disaggregated storage. This project presents a first step toward improving the efficiency and utilization of Duke University's compute infrastructure by connecting centralized and individual GPU servers via a high-speed RDMA network. It is expected to reduce the time, effort, and financial burden that researchers typically face in enabling the scaling of scientific workloads in a number of scientific fields including but not limited to ML and LLMs, biomedical imaging, and molecular dynamics. The project will contribute to workforce development by training graduate and undergraduate students in high-performance computing, network optimization, and large-scale data management. Software artifacts, papers, and tutorials developed as part of this project will be released on the following website https://sites.duke.edu/dream/. This website will be regularly maintained for five years after the completion of the project. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Focus Areas

machine learning

Eligibility

universitynonprofitsmall business

How to Apply

Funding Range

Up to $750K

Deadline

2027-06-30

Complexity
Medium
Start Application

One-time $749 fee · Includes AI drafting + templates + PDF export

AI Requirement Analysis

Detailed requirements not yet analyzed

Have the NOFO? Paste it below for AI-powered requirement analysis.

0 characters (min 50)