- Company Name
- Nubank
- Job Title
- Staff Machine Learning Engineer (Infrastructure)
- Job Description
-
**Job Title:** Staff Machine Learning Engineer (Infrastructure)
**Role Summary:**
Design, build, and operate scalable, high‑performance AI/ML infrastructure on cloud platforms. Enable data scientists and ML engineers to train, evaluate, and serve models reliably and cost‑effectively across the organization. Lead technical direction, mentor team members, and ensure infrastructure aligns with product needs and operational excellence.
**Expectations:**
- Demonstrate deep expertise in distributed systems, cloud infrastructure, and production‑grade ML pipelines.
- Deliver robust, observable, and fault‑tolerant solutions for training and inference workloads.
- Own end‑to‑end infrastructure components with a strong product mindset.
- Contribute to architectural decisions, mentor junior engineers, and uphold high engineering standards.
**Key Responsibilities:**
- Architect and implement cloud‑native, multi‑region AI infrastructure (GCP/AWS, Kubernetes, GPU/CPU orchestration).
- Develop and maintain automated pipelines for model training, evaluation, deployment, and monitoring.
- Optimize AI workloads using techniques such as PEFT, kernel fusion, mixed‑precision training, and pipeline parallelism.
- Build infrastructure‑as‑code (Terraform, Pulumi) and enforce CI/CD practices for reproducibility.
- Implement comprehensive observability (metrics, logging, alerting) for batch and real‑time systems.
- Collaborate with cross‑functional AI teams to gather requirements and ensure platform usability.
- Lead performance tuning, cost‑optimization, and scaling initiatives.
**Required Skills:**
- Strong background in systems and infrastructure engineering (distributed systems, scalability, reliability).
- Proven experience designing, operating, and optimizing production ML pipelines.
- Proficiency in Python and Go (or comparable languages) with clean, testable code practices.
- Hands‑on expertise with cloud platforms (GCP or AWS), Kubernetes, GPU/CPU orchestration, and IaC tools (Terraform, Pulumi).
- Knowledge of AI workload optimizations: PEFT, kernel fusion, mixed‑precision, resource scheduling.
- Solid experience in observability frameworks (monitoring, alerting, logging, fault tolerance).
- Ability to work in high‑impact, cross‑functional teams and mentor peers.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Computer Engineering, Software Engineering, or a related technical field (Master’s preferred).
- Relevant cloud certifications (e.g., Google Cloud Professional Engineer, AWS Solutions Architect) are a plus but not mandatory.