- Company Name
- Cloudheight Solutions
- Job Title
- ML Engineer
- Job Description
-
**Job title**
ML Engineer (Performance Optimization)
**Role Summary**
Lead end‑to‑end performance engineering for AI/ML foundation models, driving latency, cost, and throughput improvements across training, inference, and deployment pipelines. Design GPU‑accelerated components, implement custom CUDA kernels, and build internal tooling for continuous performance validation.
**Expectations**
- 5+ years of hands‑on experience in ML systems, performance engineering, or advanced software engineering roles (or 3+ years with an MS/PhD in CS, EE, or related field).
- Proven expertise in profiling, debugging, and optimizing deep‑learning workloads for latency, throughput, and cost at scale.
- Senior knowledge of distributed training/inference strategies (data/model/sharding, pipeline parallelism).
**Key Responsibilities**
- Own performance, scalability, and reliability of foundation models during training, inference, and deployment.
- Profile and optimize the entire ML stack: data pipelines, training loops, inference serving, and deployment workflows.
- Design, implement, and integrate GPU‑accelerated components; develop custom CUDA kernels as needed.
- Reduce latency and cost per inference token while maximizing throughput and GPU utilization.
- Translate product requirements into measurable performance goals (p50/p95/p99 latency, throughput, GPU‑utilization, memory footprint, cost/ token) and technical roadmaps.
- Build and maintain internal benchmarking, evaluation harnesses, and automation for continuous performance validation.
- Contribute to model architecture and system‑design decisions that impact performance, robustness, and operational efficiency.
- Advocate best practices for performance‑aware development, monitoring, and continuous improvement across the engineering team.
**Required Skills**
- Deep learning frameworks: PyTorch (core) and familiarity with TensorFlow.
- Model export/runtime formats: TorchScript, ONNX, SavedModel.
- CUDA programming: kernel development, GPU memory management, asynchronous execution.
- Performance optimization techniques: mixed precision (FP16/BF16/AMP), quantization (PTQ/QAT, int8, q4/q8), pruning, distillation, activation checkpointing, operator fusion, batch/caching strategies.
- Experience with large transformer models, attention kernel optimization, and memory/compute trade‑offs.
- Distributed training/inference: data/model/sharding, pipeline parallelism, tensor parallelism, ZeRO, sharding, Horovod.
- Cloud deployment: AWS, Azure, or GCP; containerized deployments with Docker, Kubernetes.
- Experiment tracking, monitoring, and evaluation pipelines.
- Optional: custom CUDA kernel development, integration with low‑level GPU libraries (cuBLAS/cuDNN, NCCL), inference serving (Triton, TensorRT, FasterTransformer, DeepSpeed).
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Electrical Engineering, or related technical field (preferred MS/PhD for candidates with fewer years of experience).