Skip to main content

CAREER: Reforming Profiling Techniques to Guide Systemic Performance Tuning for GPU-Accelerated Deep Learning Workloads

NSF

open

About This Grant

Graphics Processing Units (GPUs) are the go-to choice for deep learning due to their exceptional computational power and massive parallelism. However, maximizing GPU performance for model development and inference remains notoriously challenging as models grow increasingly complex, spanning multiple abstraction layers: the upstream Python layer, the midstream C/C++ layer, and the downstream GPU kernel layer. While this layered complexity meets diverse application needs, it also embeds inefficiencies that are difficult to detect due to intricate cross-layer interactions. The project addresses these inefficiencies through a comprehensive, cross-layer performance analysis of deep learning models. The project’s novelties are advancing state-of-the-art profiling techniques to enable systemic performance tuning across all layers. The project's broader significance and importance are deepening the understanding of systemic performance issues in deep learning, thus strengthening foundations in code analysis and advancing progress in fields increasingly reliant on deep learning, such as image processing. With interest from industry leaders like Meta, the project shows strong potential for translating academic insights into practical applications. Additionally, the project contributes to educational and outreach goals by integrating its findings into computer science curricula and K-12 programs to cultivate a workforce skilled in performance analysis and optimization. Three innovative analysis techniques structure the project. (1) Unified binary code analysis: It consolidates all layers of deep learning models into a shared binary abstraction, enabling the identification of cross-layer inefficiencies in code segments and data objects. (2) Incremental analysis: It incrementally narrows the scope of monitored performance metrics to pinpoint the root causes of inefficient code segments identified in the unified binary analysis. (3) Data object analysis: It addresses inefficient data objects identified in the unified binary analysis to diagnose their root causes. Together, these techniques form a comprehensive approach to performance tuning, addressing inefficiencies from a systemic perspective and maximizing GPU capabilities in deep learning. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Focus Areas

computer scienceeducation

Eligibility

universitynonprofitsmall business

How to Apply

Funding Range

Up to $342K

Deadline

2030-06-30

Complexity
Medium
Start Application

One-time $749 fee · Includes AI drafting + templates + PDF export

AI Requirement Analysis

Detailed requirements not yet analyzed

Have the NOFO? Paste it below for AI-powered requirement analysis.

0 characters (min 50)