Job Specifications
Lead DevOps Engineer - HPC / ML Infrastructure / Platform Engineer
We are seeking an exceptional infrastructure engineer to build and operate the HPC backbone for their mission. This is a unique opportunity to create the foundational platform that will power complex predictive modeling and process immense, planetary-scale datasets.
You will be the "engineer's engineer," enabling a team of world-class scientists to solve deeply meaningful real-world problems.
Why Join?
A generous salary of up to $230k + equity + 15% bonus + full benefits package.
Mission & Impact: Your work will directly contribute to a mission with significant global importance, building systems that generate critical insights from complex environmental data.
Greenfield Ownership: You will be the principal owner of the computational environment. This isn't about maintaining an existing system; it's about architecting, building, and scaling it from the ground up.
Technical Challenge: This role operates at the intersection of HPC, MLOps, and large-scale data. You will solve challenging problems related to distributed computing, automation, and performance at a massive scale.
Culture & Compensation: Join a small, brilliant team that values direct and honest communication. The role comes with a highly competitive salary, significant equity, a performance bonus, and comprehensive benefits.
The Company
My client is a venture-backed startup dedicated to tackling major environmental and scientific challenges. They are leveraging cutting-edge techniques and vast, complex datasets to create a new class of predictive insights. By building robust, scalable technology, they are providing solutions that were previously out of reach.
The Role
As the Principal Software Infrastructure Engineer, you will have ultimate responsibility for the reliability, scalability, and performance of the company's entire computational platform.
Architect, implement, and manage a sophisticated HPC cluster and cloud environment designed for both traditional scientific computing and modern machine learning workflows.
Champion an "Infrastructure-as-Code" philosophy, automating everything from provisioning and configuration to deployment and monitoring.
Build and own the CI/CD pipelines for infrastructure, ensuring the entire environment is reproducible, stable, and secure.
Embody a Site Reliability Engineering (SRE) mindset, proactively identifying and eliminating performance bottlenecks across the stack, from the Linux kernel to the network layer.
Serve as the key technical partner to the science and machine learning teams, understanding their needs and building the robust platform they need to succeed.
This hybrid role requires the ability to work from an office in either the Greater Denver or Greater Boston area at least three days per week. Relocation assistance is available.
The Essential Requirements
This role is for a hands-on builder, not just a user of platforms. You must have:
Demonstrable experience architecting, building, and owning core infrastructure from first principles.
A deep and practical understanding of large-scale, distributed computing environments (e.g., HPC clusters, supercomputing).
Expertise with modern platform technologies, including Kubernetes and Docker, for service orchestration.
Proven proficiency with Infrastructure-as-Code (IaC) tools such as Terraform or Ansible.
Strong, hands-on knowledge of Linux/UNIX systems and proficiency in a systems programming or scripting language (e.g., Python, C++, Go, Rust).
Experience handling and processing massive, multi-terabyte datasets.
Either: A professional background in a scientific, research, or mission-driven R&D environment (e.g., Earth observation, aerospace, genomics, physics) OR: Experience from a domain with analogous data challenges, such as high-frequency trading data platforms, large-scale IoT sensor systems, or complex geospatial logistics.
What Will Make You Stand Out
Specific experience building and operating the infrastructure for large-scale AI/ML model training and deployment.
Professional development experience with C++.
*Data Scientists, ML Scientists, BI Engineers, Data Analysts, Application Engineers, and ML Engineers who focus on model productionization, feature stores, and using high-level MLOps tools will likely fail the deep, low-level systems interview questions, and are not a fit for this role unfortunately.
If you are interested in this role, please apply with your resume through this site.
SEO Keywords for Search
Principal Platform Engineer, Lead DevOps Specialist, Kubernetes, Docker, Terraform, Site Reliability Engineer SRE, High-Performance Computing, Ansible, CI/CD, Python, MLOps Engineer, Linux, C++, Scientific Computing, Data Infrastructure Professional, IaC, AWS, GCP, Cluster Computing, Automation.
Disclaimer
DISCLAIMER: No terminology in this advert is intended to discriminate on the grounds of age, sex, race, religion or belief, disability, pregnancy