Job Description
What you'll be doing:
- Performance Profiling & Optimization: Utilize profiling tools (e.g., Nsight, PyTorch Profiler) to identify bottlenecks in data loading, gradient computation, and communication. Implement optimizations like kernel fusion, sharding, and tiling to improve step time.
- Distributed Training: Optimize distributed training pipelines using frameworks such as PyTorch Distributed.
- Kernel Development: Design and maintain high-performance GPU kernels in Triton or CUDA for state-of-the-art ML workloads.
- Data Pipeline Engineering: Optimize robust data loading pipelines that maximize training throughput.
- Education: Bachelor's, Master's degree, or PhD in Computer Science, Computer Engineering, or a related technical discipline.
- Software Engineering: Strong proficiency in Python.
- ML Frameworks: Extensive hands...