Skills

Communication Leadership Python Java Go CI/CD Kubernetes Architecture Programming Azure AWS cloud platforms Recruitment GCP Terraform Prometheus Grafana Infrastructure as Code

Job Specifications

Site Reliability Engineer
Contract - 12 months
Inside IR35
Hybrid working
£400-550 per day depending on experience
Job Description
My client is looking for a skilled Senior Site Reliability Engineer to play a key role in improving the reliability, scalability, and operational performance of their production systems. This role works closely with product and engineering teams to enhance system reliability, architecture, deployment safety, and observability.
Role Summary
My client is seeking a Senior Site Reliability Engineer to join a centralized Technical Operations function, where you will lead reliability initiatives and support operations across a range of large-scale, customer-facing digital services.
Operating within a centralized SRE model, you will partner with product and engineering teams while maintaining shared responsibility for production reliability, resilience, and scalability. The role includes participation in an on-call rotation supporting critical services, with shared ownership of overall system health.
You will be responsible for defining reliability standards, influencing architectural improvements, managing complex incidents, and building automation to improve deployment safety and operational efficiency. Your work will directly support high-traffic systems used by a global audience.
Key Responsibilities
Reliability & Risk Engineering
My client is looking for someone who can:

Identify systemic reliability risks and drive long-term preventative improvements

Define and refine SLIs, SLOs, and error budgets aligned with business and customer outcomes

Lead complex incident management, post-incident reviews, and remediation planning

Depth at Networkign Fundamentals - trouble shoting network infrastructure is key
Experiecne working as senrio SRE particularly around AWS
Architecture & Resilience
You will:

Review and influence system architecture to improve scalability, availability, and fault isolation

Design strategies for high availability, graceful degradation, and disaster recovery

Evaluate trade-offs between performance, cost, and operational risk

CI/CD & Deployment Safety
The successful candidate will:

Improve deployment pipelines and implement automation to reduce risk and accelerate delivery

Implement safe deployment strategies such as canary releases and blue/green deployments

Ensure strong rollback and recovery mechanisms

Observability & Performance
You will be expected to:

Build and enhance observability solutions including metrics, logging, and tracing

Work with teams to reduce alert fatigue and improve signal quality

Diagnose performance bottlenecks across infrastructure and applications

Infrastructure & Automation
My client is seeking someone who can:

Design and operate cloud-native, containerised workloads at scale

Use Infrastructure as Code to build and manage resilient platforms

Develop automation to reduce manual effort and operational risk

Cross-Functional Leadership
You will:

Mentor engineers and promote SRE best practices across teams

Collaborate with engineering, product, and security stakeholders to improve system reliability

Required Qualifications
My client is looking for candidates with:

A degree in Computer Science, Engineering, or equivalent practical experience

Strong experience designing and operating CI/CD systems with deployment safety practices

Excellent communication skills with the ability to influence cross-functional teams

7+ years of experience in SRE, production engineering, or systems engineering roles

Strong knowledge of distributed systems concepts, including consistency and failure handling

Hands-on experience with major cloud platforms (e.g., AWS, GCP, Azure), including multi-region environments

Strong experience with Kubernetes and container orchestration at scale

Proficiency in at least one programming language such as Go, Python, or Java

Proven experience managing high-severity incidents and leading remediation efforts

Preferred Qualifications
Ideally, candidates will also have:

Experience with multi-region or multi-cloud architectures

Familiarity with observability tools such as Prometheus, Grafana, or Datadog

Previous mentoring or technical leadership experience

Experience with Infrastructure as Code tools such as Terraform or CloudFormation

Exposure to AI-assisted tooling for incident analysis or operational efficiency

Desired Skills and Experience

7+ years of experience in SRE, production engineering, or systems engineering roles
Strong knowledge of distributed systems concepts, including consistency and failure handling
Hands-on experience with major cloud platforms (e.g., AWS, GCP, Azure), including multi-region environments
Strong experience with Kubernetes and container orchestration at scale
Proficiency in at least one programming language such as Go, Python, or Java
Proven experience managing high-severity incidents and leading remediation efforts

Sphere Digital Recruitment is acting as an Employm

About the Company

Sphere Digital Recruitment is a multi award-winning agency specialising in recruiting marketing, sales, analytics, product & creative talent on a permanent and contract basis in the UK, Europe and North America It's our job to be the experts and Sphere Digital Recruitment embrace the ever-changing and growing digital world. We're proud to help businesses grow their digital teams and people take the next steps in their career. Above all, we are a firm of passionate recruiters who love to have meaningful and engaging experie... Know more