- Company Name
- Loopio
- Job Title
- Senior Manager, Software Engineering (Infrastructure)
- Job Description
-
Job title: Senior Manager, Software Engineering (Infrastructure)
Role Summary: Lead and grow multiple engineering teams focused on Site Reliability Engineering, Cloud Infrastructure, and MLOps. Own the design, build, and operation of production systems to ensure reliability, scalability, and cost efficiency while enabling advanced AI and agentic workflows across the platform.
Expactations: Deliver a high‑performance, resilient infrastructure that serves the engineering organization as a platform‑as‑a‑product. Align technical roadmaps with business objectives, secure senior stakeholder confidence, and continuously improve operational excellence through data‑driven reliability and cost optimisation.
Key Responsibilities:
• Grow and mentor SRE, Cloud Infrastructure, and MLOps teams; coach managers and senior engineers.
• Design, build, and operate scalable, observable, and resilient production infrastructure.
• Define and evolve SLIs, SLOs, error budgets, and incident response processes; lead blameless post‑mortems.
• Ensure ML model inference pipelines and vector databases meet the same reliability standards as core SaaS services.
• Own cloud architecture strategy, capacity planning, disaster recovery, and business continuity.
• Drive the MLOps roadmap for model deployment, monitoring, and scaling, including LLM orchestration and RAG pipelines.
• Lead Cloud FinOps, optimise AI compute costs, and establish IaC, configuration management, and secrets handling standards.
• Partner with Security to implement secure‑by‑default infrastructure, backup, and recovery strategies.
• Communicate risks, trade‑offs, and technical priorities to senior leadership and cross‑functional teams.
• Collaborate with Product Engineering to deliver high‑impact AI features without compromising platform stability.
Required Skills:
• 8+ years in infrastructure, SRE, or cloud engineering; 3+ years leading specialized teams.
• Proficiency in AWS (preferred) and modern IaC tools (e.g., Terraform).
• Expertise in managing large‑scale containerized environments and observability stacks.
• Strong incident‑response background, including blameless post‑mortems.
• MLOps knowledge: GPU orchestration, model serving, data pipeline reliability.
• Capable of cloud budgeting and FinOps; experience managing significant cloud costs.
• Excellent strategic communication and stakeholder management.
• Familiarity with AI agentic workflows or autonomous orchestration is a plus.
Required Education & Certifications:
• Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent professional experience).
• Relevant certifications (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator, Terraform Associate) preferred.