Job Description
NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments. We are looking for Principal Software Engineers to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.
This role is for senior technical leaders who can define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.
What you’ll be doing:
+ Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
+ Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
+ Establish patterns for Kubernetes-based GPU cluster operations acro...
This role is for senior technical leaders who can define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.
What you’ll be doing:
+ Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
+ Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
+ Establish patterns for Kubernetes-based GPU cluster operations acro...