- Company Name
- SS&C Technologies
- Job Title
- Site Reliability Engineer, Cloud Incident Response
- Job Description
-
Job title: Site Reliability Engineer, Cloud Incident Response
Role Summary: Lead the design, operation, and continuous improvement of production systems to ensure high availability, low MTTR, and robust observability across Kubernetes, AWS, and IaC pipelines.
Expactations: Deliver rapid incident response, maintain and enhance monitoring dashboards, drive automation, collaborate on CI/CD, and embed reliability best practices across cross‑functional teams.
Key Responsibilities:
- Monitor, troubleshoot, and resolve incidents across cloud services, Kubernetes clusters, and application infrastructure.
- Build and maintain Grafana, Datadog, Splunk, Prometheus/OpenTelemetry dashboards and alerts for anomaly detection and root‑cause analysis.
- Operate and harden EKS/Kubernetes clusters: deployments, autoscaling, rollouts, service mesh, ingress, upgrades, and security hardening.
- Design and manage AWS workloads (EC2/EKS, RDS/Aurora, VPC, IAM, ALB/NLB, CloudWatch, S3) with a security‑first approach.
- Codify infrastructure using Terraform (modules, workspaces, remote state, policy as code) and integrate IaC into CI pipelines.
- Partner with product and engineering teams to improve GitHub Actions/Jenkins/Argo CD pipelines, progressive delivery, and change management.
- Define, track, and iterate SLOs, SLIs, and error budgets; lead blameless post‑mortems and reliability reviews.
- Participate in defined on‑call rotation, refine runbooks, and enhance alert quality for sustainable on‑call experience.
- Identify systemic reliability gaps and implement durable fixes: architecture, capacity planning, caching, resilience patterns, chaos engineering, and rate limiting.
Required Skills:
- 5+ years in SRE, DevOps, or production systems engineering.
- Hands‑on observability expertise with Grafana, Datadog, Splunk, Prometheus, OpenTelemetry, and log‑streaming tools.
- Strong Kubernetes (EKS preferred) knowledge: controllers, networking, storage, HPA/VPA, Helm, and troubleshooting.
- Practical AWS skills covering networking, IAM, EC2/EKS, RDS/Aurora, S3, CloudWatch, ALB/NLB, VPC, and security best practices.
- Terraform proficiency: module design, state management, DRY patterns, and CI for IaC.
- Scripting in Python, Go or Bash; solid Linux, networking (DNS, TLS, HTTP, TCP), and Git command knowledge.
- Proven on‑call experience with effective incident communication and post‑incident follow‑through.
- Strong collaboration and technical documentation skills (runbooks, RFCs, cross‑team influence).
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Engineering, or related field (equivalent experience acceptable).
- Relevant certifications preferred: AWS Certified Solutions Architect or DevOps Engineer, CKA/CKAD, Terraform Associate.