- Company Name
- Attis
- Job Title
- Lead DevOps Engineer - HPC / ML Infrastructure / Platform Engineer
- Job Description
-
**Job title**
Lead DevOps Engineer – HPC / ML Infrastructure / Platform Engineer
**Role Summary**
Design, build, and operate a fully automated high‑performance computing platform that supports large‑scale scientific computing and machine‑learning workflows. Own the entire infrastructure stack—from on‑prem HPC clusters to cloud environments—using Infrastructure‑as‑Code, CI/CD, and SRE principles to deliver reliability, scalability, and performance for data‑intensive research.
**Expactations**
- Full ownership of the computational environment, from architecture to day‑to‑day operations.
- Active collaboration with data scientists, ML engineers, and research teams to translate scientific and model‑training requirements into platform features.
- Demonstrated ability to design and deploy large‑scale, distributed systems that manage multi‑terabyte datasets.
**Key Responsibilities**
- Architect and implement HPC clusters and cloud-based systems for high‑speed compute and storage.
- Develop and maintain IaC pipelines (Terraform, Ansible, Pulumi, etc.) for provisioning, configuration, and continuous delivery.
- Build robust, secure CI/CD pipelines for infrastructure code and application deployments.
- Monitor, troubleshoot, and optimize performance across the stack (kernel, networking, container runtimes, cluster schedulers).
- Apply SRE best practices: incident response, capacity planning, reliability metrics, and post‑mortem analysis.
- Mentor and guide cross‑functional teams on infrastructure usage and best practices.
- Continuously evaluate emerging technologies (Kubernetes, Docker, GPU accelerators, serverless, etc.) to improve platform capabilities.
**Required Skills**
- Hands‑on experience architecting and operating large‑scale HPC or distributed computing environments.
- Strong proficiency in Kubernetes, Docker, and container orchestration.
- Expertise with IaC tools (Terraform, Ansible, or equivalent).
- Deep knowledge of Linux/UNIX systems and scripting/ systems programming (Python, Go, Rust, C++).
- Proven track record managing multi‑terabyte datasets and high‑performance data pipelines.
- Experience with cloud platforms (AWS, GCP, Azure) and on‑prem cluster management.
- Familiarity with CI/CD tooling (Jenkins, GitLab CI, Argo CD, etc.) and monitoring/observability (Prometheus, Grafana, Loki).
- Strong communication and collaboration skills with scientific/ML teams.
**Required Education & Certifications**
- Bachelor’s degree (or equivalent experience) in Computer Science, Engineering, or related field.
- Optional certifications: AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, Certified Kubernetes Administrator (CKA), or equivalent IaC/DevOps credentials.