Job Description
NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and safe to run. This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments.
What you’ll be doing:
+ Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
+ Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
+ Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
+ Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
+ Participate i...
What you’ll be doing:
+ Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
+ Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
+ Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
+ Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
+ Participate i...