- Company Name
- Red Ventures
- Job Title
- Site Reliability Engineer | Growth and Transformation
- Job Description
-
**Job Title**
Site Reliability Engineer – Growth and Transformation
**Role Summary**
Design, build, and operate highly available, scalable, and performant multi‑cloud services across AWS, GCP, and Kubernetes. Lead automation, observability, and incident response to achieve 99.99 % uptime and instill reliability best practices across the engineering organization.
**Expectations**
- Deliver end‑to‑end reliability for mission‑critical services.
- Proactively design and implement reliability into new features.
- Collaborate with developers, platform architects, and operations teams.
- Maintain a culture of continuous learning, experimentation, and operational excellence.
**Key Responsibilities**
- Monitor, troubleshoot, and restore production systems with minimal downtime.
- Build and maintain observability stacks (OpenTelemetry, New Relic, Grafana, log aggregation).
- Automate infrastructure, deployments, and configuration using Terraform and custom tooling.
- Define, manage, and enforce SLOs/SLIs aligned with business SLAs.
- Scale infrastructure capacity to accommodate traffic growth and peak events.
- Manage and optimize Kubernetes clusters on AWS and GCP.
- Participate in architecture reviews with a focus on reliability and scalability.
- Lead incident investigations, root‑cause analysis, and post‑mortem documentation.
- Advocate and embed reliability practices in application design and CI/CD pipelines.
**Required Skills**
- 3–5 years of experience in SRE, DevOps, or cloud infrastructure engineering.
- Deep experience with AWS, GCP, and Kubernetes orchestration.
- Proficiency in Infrastructure‑as‑Code (Terraform) and scripting (Python, Bash, Go).
- Strong knowledge of observability and monitoring tools (New Relic, Grafana, OpenTelemetry).
- Familiarity with CI/CD pipelines, automated deployments, and containerized workloads.
- Experience maintaining high‑availability systems (99.9 %+ uptime).
- Understanding of distributed systems, microservices, and scalability patterns.
- Proven incident response and troubleshooting skills in production environments.
- Excellent communication and collaboration abilities.
**Bonus/Preferred**
- Certifications: AWS Solutions Architect, GCP Professional Cloud Architect.
- Experience with chaos engineering, resilience testing, or large‑scale load balancing.
- Exposure to Salesforce or Adobe ecosystems, log aggregation tools (ELK, Splunk), cloud security, and multi‑region networking.
- Database performance tuning expertise.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Engineering, or related field *or* equivalent professional experience.
- Optional professional certifications (AWS Solutions Architect, GCP Professional Cloud Architect).