Job Description
What You’ll Do
- Maintain comprehensive system architecture with a deep understanding of integration patterns and dependencies across the technology stack
- Design and implement robust monitoring frameworks, intelligent alerting systems and streamlined incident response procedures
- Conduct systematic security reviews, coordinate penetration testing initiatives and perform thorough threat analysis to assess vulnerabilities
- Define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and implement automated monitoring to track service reliability
- Execute comprehensive daily operational health assessments and proactively monitor system health through advanced observability tools
- Build system resilience through chaos engineering practices, disaster recovery planning and continuous performance optimization
- Provide expert-level support and rapid incident resolution to maintain production system ...