Job Description
We are seeking a Site Reliability / Resilience Engineer to support a large-scale, enterprise technology environment. This role focuses on improving the reliability, availability, and resilience of critical services across complex, distributed systems.
You will work across cloud, infrastructure, and application ecosystems, helping ensure services are observable, recoverable, and aligned with both engineering best practices and regulatory resilience requirements.
Key Responsibilities
- Support reliability and resilience across cloud platforms (AWS, Azure, GCP)
- Work across infrastructure, networks, data centres, and application platforms
- Analyse and map service dependencies and critical service chains
- Contribute to the design and implementation of resilience and recovery strategies (RTO/RPO, failover patterns)
- Support vulnerability identification and risk reduction activities
- Enhance observability, monit...