- Company Name
- Avalara
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
Job Title: Senior Site Reliability Engineer
Role Summary:
Senior Site Reliability Engineer responsible for building and scaling reliable, AI‑driven systems in a global SaaS environment. Designs and implements automation, observability, and self‑healing pipelines that integrate model‑centric and agentic AI, while managing SLOs, SLIs, and incident response.
Expectations:
- 5+ years in large‑scale SaaS or distributed systems.
- Bachelor’s in Computer Science, Engineering, or equivalent.
- Willingness to participate in rotating on‑call rotation.
- Strong commitment to reducing toil, measuring everything, and building autonomous reliability.
Key Responsibilities:
- Build AI‑powered reliability systems meeting MVR and SMM standards.
- Implement Agentic AI workflows (LangChain, n8n, MCP servers, custom agents) for incident analysis, assessment, and resolution.
- Design AI‑driven observability stacks (Prometheus, Grafana, Loki, Tempo, OpenTelemetry) with predictive analytics and ML anomaly detection.
- Orchestrate reliability operations using AI Flow tools (n8n, Airplane.dev, Temporal.io) for alert remediation, data enrichment, and incident collaboration.
- Automate infrastructure provisioning, remediation, and observability pipelines with Go, Python, or Terraform.
- Operate and extend MCP servers to connect AI agents with production telemetry.
- Define and manage SLOs, SLIs, and SLAs; improve signal quality via ML‑based alert noise reduction and event correlation.
- Troubleshoot production systems using AI‑assisted diagnostics, LLM copilots, and pattern recognition on logs, traces, and metrics.
- Integrate AI reliability feedback into CI/CD pipelines with development teams.
- Mentor engineers on AIOps practices and contribute to the global AI Reliability Playbook.
Required Skills:
- Agentic AI & AIOps: MCP servers, AI Flow tools, predictive maintenance, anomaly detection, automated root cause analysis.
- Software Engineering: Go, Python, automation frameworks, API integrations.
- Observability & Monitoring: Prometheus, Grafana, Loki, Tempo, OpenTelemetry, ML‑based metric analysis.
- Infrastructure as Code: Terraform or Pulumi; modern CI/CD (GitLab preferred).
- Cloud Platforms: AWS, GCP, Oracle Cloud or Azure; multi‑cloud reliability focus.
- Container Orchestration: Kubernetes, Docker; low‑level container internals.
- Linux Administration: hardening, tuning, troubleshooting.
- Networking: OSI model, TCP/IP, DNS, load‑balancing in cloud‑native environments.
- Automation & Workflows: n8n, Airplane.dev, LangChain, custom AI flow builders.
- Documentation & Communication: clear, precise reporting to customers and partners.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Engineering, or equivalent technical experience.
- Certifications in cloud platforms (AWS, GCP, Azure, Oracle) and IaC tools (Terraform, Pulumi) are advantageous.
---