Job Description
About Us
We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable.
Role Overview
We are seeking an Infrastructure Tooling & Observability Engineer to act as a key engineering force within our global Infrastructure Operations organisation. Working closely with our SRE teams, you will translate high-level reliability objectives into scalable, production-ready systems that directly improve the resilience, efficiency, and performance of our global infrastructure.
This role goes beyond traditional monitoring. You will help design and build the internal control plane that enables operations at scale across a rapidly growing GPU fleet. Your work will focus on transforming complex, high-volume telemetry—spanning logs, metrics,...