- Company Name
- Magnet Forensics
- Job Title
- Technical Manager – Site Reliability Engineering (SRE)
- Job Description
-
**Job title**: Technical Manager – Site Reliability Engineering (SRE)
**Role Summary**: Lead the creation and expansion of a central Site Reliability Engineering function for a global SaaS organization. Build and manage a high‑performing SRE team, define reliability practices, and drive operational excellence across cloud‑based production platforms.
**Expectations**:
- Establish a top‑tier SRE organization from scratch.
- Own the end‑to‑end reliability strategy, including incident response, observability, and automation.
- Coach and mentor a team of SREs and engineers, fostering accountability and psychological safety.
- Collaborate cross‑functionally with Engineering, Product, Security, and Operations to align reliability initiatives with business priorities.
- Provide hands‑on technical guidance on infrastructure, code, and architectural decisions.
**Key Responsibilities**:
1. Recruit, onboard, and grow a high‑performing SRE team.
2. Define and continuously evolve SRE standards, processes, and tooling.
3. Drive automation in deployment pipelines, monitoring, and operational workflows.
4. Improve observability, incident response, and reliability of production services.
5. Participate in incident investigations and post‑mortems to enforce learning cycles.
6. Write production‑grade code (Python or equivalent) for tooling and automation.
7. Partner with Engineering, Product, and Security to balance reliability with feature delivery.
**Required Skills**:
- Proven leadership of SRE or Production Engineering teams at a SaaS company.
- Strong technical foundation in cloud infrastructure (AWS preferred), distributed systems, and DevOps/SRE practices.
- Coding proficiency in Python or a modern high‑level language; experience building automation tools.
- Mastery of observability stacks (Prometheus, Grafana, Datadog, OpenTelemetry) and IaC (Terraform, AWS CDK).
- Deep knowledge of incident management, reliability engineering, and continuous improvement.
- Ability to mentor, coach, and foster an accountable, learning‑oriented culture.
- Bias toward automation, repeatability, and operational excellence.
- Bonus: experience with regulatory compliance or high‑availability requirements.
**Required Education & Certifications**:
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- Relevant certifications (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator, or equivalent) are a plus.