🎯
Full-Time Opportunity: This is a permanent, full-time position with a competitive package and real career growth potential.
Job Description
Responsibilities
- Navigate, troubleshoot, and recover dynamic infrastructure and long-running processes in real-time using command-line tools.
- Master and manage highly containerized environments, including orchestrating Dockerized sandboxes and CI/CD workflows.
- Build, maintain, and optimize systems for AI model training and high-throughput compute environments.
- Respond swiftly to system errors, executing dynamic mid-operation replanning and recovery.
- Collaborate with engineering and AI teams to ensure seamless integration, reliability, and performance.
- Document system architectures, incident responses, and recovery protocols with meticulous clarity.
Requirements
- Have demonstrated expert proficiency working in terminal environments for system builds, server administration, and infrastructure management.
- Possess advanced problem-solving skills for multi-step troubleshooting, f...