💼 Full-Time Position

Senior System Architect, Infrastructure Reliability

🏢
NVIDIA
📍 Santa Clara, CA, United States
📍
Location
Santa Clara, United States
📅
Posted
June 06, 2026
Type
Full-Time
🎯

Full-Time Opportunity: This is a permanent, full-time position with a competitive package and real career growth potential.

Job Description

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What you'll be doing:
+ Architect Failure Attribution Frameworks: Build a scalable flight recorder for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
+ Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
+ Distr...