cover image
Andiamo

Principal Staff Engineer – AI Infrastructure - AI/ML Leader

Hybrid

Toronto, Canada

Senior

Full Time

27-01-2026

Share this job:

Skills

Leadership Python Java Go CI/CD Docker Kubernetes Monitoring Resource Allocation Research Training Architecture Systems Architecture Machine Learning PyTorch TensorFlow Programming Azure AWS cloud platforms C++ GCP CI/CD Pipelines Prometheus Grafana

Job Specifications

Principal Staff Engineer - AI Infrastructure

About The Role

We are seeking a Principal Staff Engineer to lead the architecture and development of our next-generation AI infrastructure. This role sits at the intersection of large-scale distributed systems and cutting-edge machine learning, powering the platforms that enable researchers and engineers to build, train, and deploy AI models at global scale.

As a senior technical leader, you will define architectural strategy, influence cross-organizational initiatives, and guide the design of highly reliable, efficient, and scalable systems. You’ll balance deep technical execution with strategic vision—mentoring senior engineers, collaborating with AI researchers, and ensuring our infrastructure accelerates innovation while maintaining world-class reliability.

What You’ll Do

Design & Scale AI Infrastructure: Architect and build distributed training, inference, and data pipelines that support large-scale AI workloads across GPUs and heterogeneous environments.
Lead Cloud-Native Innovation: Drive adoption of Kubernetes, Docker, and modern orchestration frameworks to optimize model deployment, resource allocation, and cluster utilization.
Optimize Performance at Scale: Develop high-throughput, low-latency services and memory-efficient systems to support petabyte-scale data and massive model sizes.
Advance Observability & Reliability: Implement monitoring, tracing, and fault-tolerance strategies to ensure resilient AI systems in production.
Collaborate with Research & Product: Partner with ML scientists, product engineers, and platform teams to design infrastructure that accelerates experimentation and model iteration.
Mentor & Inspire: Support the technical growth of senior engineers, fostering a culture of excellence, innovation, and ownership.
Shape Technical Strategy: Define long-term roadmaps for AI infrastructure, balancing near-term delivery with foundational investments in scalability, efficiency, and reliability.

What We’re Looking For

Extensive Experience: 10+ years in distributed systems, large-scale infrastructure, or platform engineering, with experience supporting AI/ML workloads strongly preferred.
Programming Mastery: Deep expertise in Java, Python, or C++, with proven ability to build performant and reliable systems.
AI/ML Infrastructure Knowledge: Familiarity with ML frameworks (TensorFlow, PyTorch, JAX), distributed training strategies, GPU scheduling, and data pipeline optimization.
Modern Infrastructure Skills: Hands-on experience with Kubernetes, Docker, CI/CD pipelines, cloud platforms (AWS/GCP/Azure), and observability tools (Prometheus, Grafana, Datadog).
Systems Design Expertise: Strong foundation in algorithms, concurrency, and systems architecture for high-scale, fault-tolerant environments.
Leadership & Influence: Demonstrated success driving cross-functional initiatives, mentoring senior engineers, and setting engineering-wide standards.
Product Mindset: Ability to balance technical rigor with usability and speed, ensuring infrastructure empowers rapid iteration and impactful outcomes.

About Andiamo

Talent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.

For over 20 years, we've maintained the status of tier-one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.

Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com

About the Company

We're a different kind of recruiting firm. We employ Research & Data Analysts alongside Recruiters to bring our clients the top 2% of passive technology and go-to-market talent. With unique data partnerships and a revolutionary technology platform, we're mining and curating massive amounts of data to bridge the talent gap. Our data-driven approach has helped us become the top recruiting partners of Amazon.com, HBO, Bloomberg, Goldman Sachs, TripAdvisor, Audible, MasterCard, and others. Andiamo has locations in New York, ... Know more