- Company Name
- Ventula Consulting
- Job Title
- Senior Machine Learning Engineer
- Job Description
-
Job Title:
Senior Machine Learning Engineer
Role Summary:
Execute end‑to‑end MLOps solutions for high‑throughput inference, managing GPU‑enabled Kubernetes clusters built from scratch with kubeadm and Helm. Drive automation pipelines (ArgoCD, GitHub Actions, Terraform), monitor and observe ML workloads (Prometheus, Grafana), and ensure production reliability and scalability. Collaborate with ML and product teams to align infrastructure with evolving project needs, mentor colleagues, and deliver continuous improvement in ML lifecycle management.
Expectations:
- Lead the design, deployment, and operation of scalable GPU Kubernetes environments.
- Deliver robust CI/CD pipelines and IaC to speed model iteration pipelines.
- Maintain 24/7 system reliability, including on‑call rotation and incident response.
- Mentor engineers on best practices in MLOps, cloud ops, and Kubernetes architecture.
Key Responsibilities:
- Deploy and maintain GPU‑enabled Kubernetes clusters from scratch using kubeadm, Helm, and diagnostics tooling.
- Build automation workflows for data ingestion, model training, and inference deployment in Python or Go.
- Implement CI/CD pipelines with ArgoCD, GitHub Actions, and Terraform for reproducible deployments.
- Configure monitoring and observability stacks (Prometheus, Grafana, cloud‑native stacks) to track performance, resource usage, and model metrics.
- Scale infrastructure to support high‑volume inference and model training across multiple teams.
- Conduct capacity planning, cost optimization, and performance tuning for ML workloads on AWS.
- Facilitate incident review, root‑cause analysis, and post‑mortem documentation.
- Provide technical guidance and hands‑on coaching to junior MLOps and ML engineers.
Required Skills:
- Deep expertise in Kubernetes cluster architecture, GPU scheduling, and Helm chart development.
- Experience building clusters from scratch using kubeadm and managing them at the infrastructure level.
- Proficiency in Python or Go for developing ML automation scripts and pipelines.
- Strong background in Docker, container orchestration, and cloud‑native deployment practices.
- Hands‑on experience with AWS services (EKS, S3, EC2, SageMaker) for ML workloads.
- Skilled with CI/CD tools (ArgoCD, GitHub Actions) and IaC (Terraform).
- Familiarity with monitoring/observability solutions (Prometheus, Grafana, cloud monitoring).
- Proven track record of deploying and monitoring ML models in production (experiment tracking, versioning, monitoring).
- Excellent troubleshooting, incident response, and system optimization skills.
Required Education & Certifications:
- Bachelor’s degree or higher in Computer Science, Engineering, or related field, or equivalent practical experience.
- Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) preferred.
- AWS Certified Solutions Architect – Associate or similar cloud certification highly desirable.