- Company Name
- Capital on Tap
- Job Title
- Site Reliability Engineer
- Job Description
-
Job Title: Site Reliability Engineer
Role Summary:
Design, implement, and maintain highly available, scalable infrastructure for Capital On Tap’s banking and repayments platform. Drive reliability by setting SLAs, SLIs, and SLOs, automating operations, and collaborating cross‑functionally to reduce toil and improve system visibility.
Expectations:
- Deliver robust, fast, and reliable services for thousands of customers.
- Own end‑to‑end incident response, root‑cause analysis, and post‑incident reviews.
- Lead automation initiatives to increase productivity and reduce manual work.
- Align reliability goals with product objectives and stakeholder expectations.
Key Responsibilities:
- Manage Azure resources, including virtual networks, storage, and compute services.
- Configure and monitor Azure Monitor, Datadog, NGINX, and Cloudflare for performance and security.
- Provision and maintain Kubernetes clusters and serverless workloads; deploy via Helm or CRDs/ Crossplane.
- Write and maintain IaC using Terraform, Pulumi, or similar; enforce code review and CI/CD pipelines.
- Develop and enforce SLI/SLO metrics; integrate with monitoring dashboards and alerting systems.
- Collaborate with Platform, DevOps, and Product teams to automate pipelines (Azure DevOps, Octopus Deploy, Flux).
- Conduct incident investigations, produce runbooks, and implement preventive measures.
- Evaluate and prototype new architecture proposals; recommend improvements to scalability and fault tolerance.
- Mentor junior engineers and share knowledge on observability, chaos engineering, and deployment practices.
Required Skills:
- Proven experience managing Azure environments; Azure DevOps, Octopus, Flux, or equivalent CI/CD tooling.
- Strong Linux and Windows administration background.
- Expertise in IaC: Terraform, Pulumi, or similar; orchestration with Kubernetes, Docker, and serverless frameworks.
- Experience with monitoring solutions (Datadog preferred).
- Scripting proficiency in Go, PowerShell, Python, or C#; ability to write automation, monitoring, and deployment scripts.
- Excellent communication, teamwork, and stakeholder‑management skills.
- Familiarity with observability, tracing, and SLI/SLO/SLM concepts.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Engineering, or related field (preferred).
- Relevant cloud certifications (e.g., Microsoft Certified: Azure Administrator Associate, Azure Solutions Architect Expert) are a plus.
---