NSF AI Disclosure Required

NSF requires disclosure of AI tool usage in proposal preparation. Ensure you disclose the use of FindGrants' AI drafting in your application.

CIF:Small: Towards practical gradient coding

NSF

open

Machine learning systems have made revolutionary advances in several areas, including (but not limited to) automated speech and image recognition, scientific discovery, human health and national security. These advances have been made possible in large part by the training of high-capacity models that are able to capture and infer complex relationships between exhorbitant amounts of data, such as images, video, and speech. Such training is quite resource-intensive and failure-prone and typically requires the deployment of large groups of computers that operate collaboratively to achieve the overall objectives. For instance, by conservative estimates, the training of current state-of-the-art models for language understanding consume enough energy to power over one thousand average US households for a year. Moreover, a rule-of-thumb within distributed computing states: "failures are the norm, rather than the exception". This project will investigate resource-efficient and fault-tolerant schemes for distributed model training within machine learning. Specifically, the training time depends on the reliability and speed of the computers and the speed of communication between them. This project will examine techniques for simultaneously increasing both the reliability and speed of the process. If successful, this will result in significant energy and monetary savings across the board in scenarios where machine learning is routinely deployed. The ability to work with large-scale computing clusters is an essential skill for the workforce, and this project will help train undergraduate and graduate students in such techniques. The team of researchers will volunteer for mathematics tutoring activities as part of the CyMath initiative at Iowa State; CyMath offers free and open-to-all, after-school math tutoring for elementary and middle school students in Ames area schools. Overall, the goals of this project will lead to the acceleration of corresponding machine learning driven advances in a variety of fields, e.g., science and human health, and contribute towards the US economy and society. Distributed machine learning model training typically involves minimizing a loss function that depends on the training dataset with respect to a parameter vector. The number of parameters in many problems of real-life interest can range from hundreds of millions to billions. Such training is the driver of key technologies such as deep neural networks and large language models. In these systems, the workers are required to compute gradients on the data points assigned to them. Depending on the underlying architecture, the workers either communicate to-and-from a central server or exchange messages amongst themselves to perform gradient descent over several iterations. In addition to worker-node computation, communication is well recognized as being a significant part of the overall training time. It is well recognized that worker nodes (especially within cloud platforms) are prone to straggling (slow-downs and/or failures). The foundational goal of this project is to investigate failure/slowdown-resilient and communication-efficient schemes for distributed model-training for different classes of learning architectures. This will be achieved by introducing redundancy in the assignment of data points to the worker nodes and using coding-theoretic ideas for the recovery of the gradient, either exactly or approximately. The research team will examine the design of numerically stable and communication-efficient schemes that leverage the work performed by slow (as against failed) workers. They will also research classes of schemes that vary depending on the amount of system knowledge across the workers and the assumed communication between them. A key goal will be to provide rigorous guarantees of the quality of the gradient that is recovered by the system. The successful completion of the goals of this project will result in significant reductions in training times of distributed models and corresponding resource savings. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Focus Areas

machine learningmathematics

Eligibility

universitynonprofitsmall business

How to Apply

Funding Range

Up to $582K

Deadline

2028-06-30

Complexity

Medium

Start Application

One-time $749 fee · Includes AI drafting + templates + PDF export

AI Requirement Analysis

Detailed requirements not yet analyzed

Have the NOFO? Paste it below for AI-powered requirement analysis.

0 characters (min 50)