- Company Name
- CarltonOne
- Job Title
- Site Reliability Engineering Manager
- Job Description
-
**Job Title**: Site Reliability Engineering Manager
**Role Summary**: Lead a team of Site Reliability Engineers to design, build, and operate cloud‑native infrastructure that is reliable, scalable, and secure. Own end‑to‑end incident management, observability, automation, and capacity planning while aligning reliability goals with business objectives across engineering, DevOps, security, and product functions.
**Expectations**:
- Manage and grow a high‑performance SRE team, cultivating ownership, continuous learning, and operational excellence.
- Define SRE strategy, set SLOs/SLIs, and manage error budgets to balance innovation and reliability.
- Drive incident response, blameless post‑mortems, and continuous improvement of reliability metrics.
- Champion automation to reduce toil, optimize cost, and enhance system resilience.
**Key Responsibilities**:
1. **Leadership & Strategy**
- Mentor and grow SREs, leading performance reviews and career development.
- Create and enforce SRE strategy, including SLOs, SLIs, and error‑budget management.
- Collaborate with Engineering, DevOps, Security, and Product to align reliability objectives with business goals.
2. **Incident Management**
- Own incident lifecycle: detection, response, post‑mortem, and follow‑up.
- Develop runbooks, playbooks, and escalation procedures.
- Track metrics (MTTR, MTTD, frequency, severity) and report trends.
3. **Monitoring & Observability**
- Design monitoring architecture using Datadog, Grafana, CloudWatch, Prometheus, Rapid7 InsightCloudSec, Wiz, Cloudflare.
- Set alert thresholds, create dashboards, and ensure actionable visibility into system health.
4. **Automation & Optimization**
- Reduce manual toil via IaC (Terraform, CloudFormation, Helm), CI/CD pipelines (Bamboo, Jenkins, Ansible), and scripting.
- Right‑size infrastructure, monitor cost, and drive performance tuning.
5. **Security & Compliance**
- Implement IAM, RBAC, encryption, and vulnerability management best practices.
- Use security monitoring tools and integrate compliance checks.
6. **Disaster Recovery & Capacity Planning**
- Plan, test, and maintain RPO/RTO objectives.
- Forecast capacity to support business growth and adjust resources proactively.
**Required Skills**:
- 7+ years in cloud infrastructure, DevOps, or SRE; 2+ years in leadership.
- Deep expertise in AWS (EKS, EC2, S3, VPC, IAM, RDS Aurora, Lambda).
- Strong Kubernetes, container orchestration, and service mesh knowledge.
- IaC proficiency: Terraform, CloudFormation, Helm.
- CI/CD & automation experience: Bamboo, Jenkins, Ansible.
- Monitoring proficiency: Datadog, Grafana, CloudWatch, Prometheus.
- Networking fundamentals: TCP/IP, DNS, load balancing, CDN.
- Incident management and crisis leadership.
- Excellent communication, stakeholder management, and cross‑functional collaboration.
- Strategic thinking with focus on long‑term reliability and scalability.
**Required Education & Certifications**:
- Bachelor’s degree in Computer Science, Engineering, or related field.
- AWS Certified Solutions Architect or equivalent is preferred.
- SRE Practitioner, CKA, CKAD, or similar certifications are a plus.