- Company Name
- Pragmatike
- Job Title
- Principal AI/ML Engineer
- Job Description
-
Job Title: Principal AI/ML Engineer
Role Summary: Lead the design, implementation, and scaling of end‑to‑end ML Ops pipelines, overseeing training, fine‑tuning, evaluation, deployment, and monitoring of AI models across cloud and on‑prem GPU infrastructures. Drive architecture strategy, tooling, and best practices, collaborating with researchers, backend, and product teams to ensure production‑grade AI systems.
Expectations: • Design and own large‑scale ML infrastructure architecture. • Own continuous improvement of compute utilization, observability, and automation. • Mentor and influence cross‑team engineering practices. • Operate in a fast‑paced, high‑ownership environment.
Key Responsibilities:
- Architect, build, and scale ML Ops pipelines (training, fine‑tuning, rollout, monitoring).
- Design deployment, versioning, reproducibility, and orchestration across cloud/on‑prem GPU clusters.
- Optimize distributed compute (Kubernetes, autoscaling, caching, GPU allocation, checkpointing).
- Implement observability (drift, performance, throughput, reliability, cost).
- Automate dataset curation, labeling, feature pipelines, evaluation, and CI/CD for models.
- Productionize models with researchers, accelerate training/inference pipelines.
- Establish and evangelize ML Ops best practices, internal standards, and tooling.
- Mentor engineers and shape architectural direction across the AI platform.
Required Skills:
- Deep experience designing/operating production ML systems at Staff/Principal level.
- Expertise in ML Ops, distributed systems, and cloud (AWS, GCP, or Azure).
- Proficiency in Python; familiarity with TypeScript or Go for platform integration.
- Knowledge of ML frameworks: PyTorch, Transformers, vLLM, Llama‑factory, Megatron‑LM, CUDA/GPU acceleration.
- Containerization and orchestration: Docker, Kubernetes, Helm, autoscaling.
- Strong understanding of ML lifecycle workflows (train, fine‑tune, evaluate, inference, model registry).
- Experience leading technical strategy, cross‑functional collaboration, and operating in ambiguous contexts.
Bonus: deployment of enterprise‑scale LLMs, DevOps/CI‑CD IaC, GPU cluster optimization, data engineering or real‑time ML systems.
Required Education & Certifications:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field (advanced degree preferred).
- Certifications in cloud platforms (AWS, GCP, Azure) or Kubernetes/K8s/Helm are a plus.
Washington, United states
On site
Senior
28-01-2026