🎯
Full-Time Opportunity: This is a permanent, full-time position with a competitive package and real career growth potential.
Job Description
What you will be doing:
- Engage with our partners and customers to root cause functional and performance issues reported with NCCL
- Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
- Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
- Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
- Document and conduct trainings/webinars for NCCL
- Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support
What we need to see:
- B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience. Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
- Excellent C/C++ programming skill...