- Company Name
- TP-Link Systems Inc.
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
Job Title: Senior Site Reliability Engineer
Role Summary:
Lead architecture, deployment, and maintenance of microservices on multi‑cloud Kubernetes platforms to deliver secure, highly available, and scalable cloud services. Serve as technical subject matter expert, driving observability, capacity planning, incident response, and compliance across AWS, OCI, Azure, and GCP environments.
Expectations:
- Deliver end‑to‑end reliability for production microservices.
- Own incident response, root‑cause analysis, and post‑mortem documentation.
- Maintain and evolve cloud operations best practices, KPIs (SLA/SLO/SLI), and security compliance (ISO‑27001, SOC‑2, GDPR).
- Mentor junior engineers and lead process improvement initiatives.
- Participate in on‑call rotation, including after‑hours and weekend support.
Key Responsibilities:
- Design, implement, and operate Kubernetes‑based microservice deployments across multi‑cloud environments.
- Collaborate with Cloud, DevOps, and Development teams to deploy services, integrate CI/CD pipelines, and enforce security standards.
- Conduct load, chaos, and performance tests to validate scalability and availability.
- Build observability stack: monitoring, tracing, log aggregation, and alerting for cloud platforms.
- Develop, execute, and maintain disaster‑recovery plans and fail‑over procedures.
- Automate operational tasks using Python, Go, or Bash scripts.
- Define, track, and report KPIs (SLA/SLO/SLI) aligned with business metrics.
- Create and upkeep technical documentation: architecture diagrams, design documents, SOPs, and compliance artifacts.
- Enforce security practices, including IAM, network security, and data protection controls.
- Lead post‑incident investigations, identify root causes, and implement mitigations.
- Evaluate new technologies, lead POCs, and recommend tooling or platform enhancements.
- Mentor and train less experienced team members on SRE practices and tools.
Required Skills:
- 5+ years of Site Reliability Engineering experience.
- Proficiency in at least one scripting language (Python, Go, Bash, PowerShell).
- Strong understanding of Kubernetes, container orchestration, and cloud platforms (AWS, OCI, Azure, GCP).
- Experience with observability tools (Prometheus, Grafana, OpenTelemetry, ELK stack, etc.).
- Knowledge of cloud security, IAM, network and application security, and data protection.
- Familiarity with disaster‑recovery planning, chaos engineering, and performance testing.
- Ability to define and measure SLOs/SLOs/SLIs and communicate metrics to stakeholders.
- Excellent problem‑solving, analytical, and communication skills.
- Self‑motivated and capable of working independently and within cross‑functional teams.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Information Technology, or related field.
- Preferred: Expert‑level cloud certifications—AWS Solutions Architect (Professional), Azure Solutions Architect Expert, GCP Professional Cloud Architect.