- Company Name
- Greenhouse Software
- Job Title
- Senior ML Ops Engineer
- Job Description
-
Job title: Senior ML Ops Engineer
Role Summary:
Design, build, and maintain end‑to‑end infrastructure that operationalizes machine learning and large‑language‑model (LLM) workloads. Lead CI/CD pipeline development, enforce observability, ensure data quality, and collaborate with data scientists, ML engineers, and cross‑functional stakeholders to deliver production‑ready ML services.
Expectations:
- Own the full model lifecycle from prototype to production.
- Keep abreast of ML industry trends and embed best practices in toolsets.
- Champion reliability, compliance, and ethical AI standards.
- Serve on an on‑call rotation to sustain system uptime.
Key Responsibilities:
- Operationalize ML/LLM workloads and translate stakeholder feedback into production systems.
- Design, implement, and continuously improve CI/CD pipelines for rapid model iteration.
- Deploy and manage Kubernetes clusters, ensuring scalability and security.
- Implement watch‑dog observability: monitoring, logging, alerting, and tracing for ML services.
- Version and automate evaluation sets, enforce continuous quality checks across model lifecycle.
- Maintain infrastructure for data quality monitoring and performance analytics in production.
- Collaborate with ML, data science, and engineering teams to realize impactful solutions.
- Participate in on‑call rotation and capacity planning.
Required Skills:
- 5+ years in MLOps, DevOps, or related software engineering.
- Deep expertise in AWS, Kubernetes, Terraform, Argo CD, and GitOps.
- Proven experience building CI/CD pipelines for ML models.
- Strong knowledge of ML frameworks (PyTorch, MLflow, vLLM, Transformers, Torch).
- Hands‑on experience with data quality, performance monitoring, and vector databases (Opensearch).
- Familiarity with compliant AI frameworks (ISO 42001/NIST AI RMF).
- Excellent problem‑solving, communication, and teamwork.
- Bonus: experience with Bedrock/SageMaker, MetaFlow, Databricks, Vertex AI, Skypilot, Kubeflow, Loki, Grafana.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Electrical Engineering, Data Science, or related field (or equivalent experience).
- Relevant certifications strongly preferred: AWS Certified Machine Learning – Specialty, Kubernetes Certified Administrator, Terraform Associate, or similar.