Skills

Communication Python Rust Docker Kubernetes Monitoring Research Training Architecture Linux Machine Learning PyTorch TensorFlow Deep Learning Programming benchmarking Azure AWS cloud platforms C++ GCP

Job Specifications

Required Qualifications

Experience: 5+ years of hands-on experience in machine learning systems, performance engineering, or related software engineering roles focused on model optimization. (Alternatively, 3+ years with a relevant advanced degree such as MS or PhD in Computer Science, Electrical Engineering, or related field.)
Significant hands-on experience optimizing deep learning models for latency, throughput, and cost.
Proven ability to profile and debug performance bottlenecks across the stack (model, framework, runtime, and system-level).
Experience with distributed or large-scale training and inference, including data/model parallelism, pipeline parallelism, sharding, and gradient accumulation.
Practical CUDA development experience and familiarity with GPU programming concepts, tensor cores, memory management, and asynchronous execution.
Deep understanding of at least one major deep learning framework (ideally PyTorch) and experience with model export/runtime formats (TorchScript, ONNX, SavedModel).
Familiarity with optimization techniques such as mixed precision (FP16/BF16/AMP), quantization (PTQ/QAT, int8, q4/q8), distillation, pruning (structured/unstructured), activation checkpointing, operator fusion, and caching/batching strategies.
Experience working with large models (e.g., transformers) and productionizing them, understanding attention mechanisms, optimization of attention kernels, and memory/compute tradeoffs.
Experience building and operating ML systems on cloud platforms (AWS, Azure, or GCP) and containerized deployments.
Comfort working with experiment tracking, monitoring, and evaluation pipelines.

Preferred Qualifications

Experience developing custom CUDA kernels, integrating low-level GPU optimizations, or contributing to performance-focused libraries and runtimes (e.g., cuBLAS/cuDNN, cuFFT, NCCL, XLA, TVM, ONNX Runtime).
Prior experience optimizing inference serving systems and cost/latency trade-offs at scale, including use of Triton Inference Server, TensorRT, FasterTransformer, or DeepSpeed inference optimizations.
Familiarity with container orchestration (Kubernetes), serving frameworks, deployment tooling, and continuous delivery for ML models.
Experience with performance benchmarking, load testing (including MLPerf/ custom benchmarks), and building internal tooling/automation for continuous performance validation.
Background in compiler optimizations, kernel fusion, MLIR/XLA, or other systems-level optimizations.
Strong communication skills and demonstrated ability to translate product requirements into measurable performance goals and SLIs (p50/p95/p99 latency, throughput (tokens/sec), GPU utilization, memory footprint, cost per token).

Education Guidance

Bachelor's degree in Computer Science, Electrical Engineering, or a related technical field is typically expected for this role.
A master's degree or PhD is preferred for candidates with fewer years of industry experience or for roles with heavy research/architecture responsibilities.

Job Description

The role focuses on leading performance optimization across our AI/ML foundation model stack, designing GPU components, and delivering measurable reductions in latency and cost while maintaining throughput and reliability.

Key Responsibilities

Own performance, scalability, and reliability for the foundation model during both training and inference, defining success metrics and tracking improvements.
Profile and optimize the end-to-end ML stack, including data pipelines, training loops, inference serving, and deployment workflows.
Design, implement, and integrate GPU-accelerated components; develop custom CUDA kernels when existing libraries are insufficient.
Reduce latency and cost per inference token while maximizing throughput and hardware utilization through software and system-level optimizations.
Translate product requirements into clear, actionable optimization goals and technical roadmaps in close collaboration with the founders and cross-functional teams.
Build and maintain internal tooling, benchmarks, and evaluation harnesses to enable reliable experimentation, debugging, and safe rollouts.
Contribute to model architecture and system design decisions where they impact performance, robustness, and operational efficiency.
Advocate best practices for performance-aware development, monitoring, and continuous improvement across the engineering team.

Technical Keywords (for Discoverability)

Frameworks & Runtimes: PyTorch, TensorFlow, TorchScript, ONNX, ONNX Runtime, Triton Inference Server, TensorRT, TVM, XLA, MLIR
Libraries & Optimizations: cuBLAS, cuDNN, NCCL, FasterTransformer, DeepSpeed, Hugging Face Transformers, PEFT, OpenVINO
Languages & Platforms: Python, C++, CUDA, Rust; Linux, Docker, Kubernetes
Distributed & Parallelism: Data parallelism, model parallelism, pipeline parallelism, tensor parallelism, ZeRO, sharding, Horovod
Quantization & Precision: FP32/FP16/BF16, mixed precision, INT8, post-training/QA

About the Company

At CloudHeight Solutions, we are your all-in-one partner for building and elevating your online presence. Based in Lisbon, Portugal, we specialize in delivering end-to-end digital solutions designed to help businesses succeed in today’s competitive landscape. With a strong focus on empowering small and medium enterprises (SMEs) to establish a robust digital presence, we provide custom website design, SEO & marketing, professional email services, and more—offering the tools and expertise to launch, enhance, and optimize your ... Know more