cover image
UK Health Security Agency

Senior Specialist Engineer (SRE)

On site

Sutton at hone, United kingdom

Senior

Full Time

02-12-2025

Share this job:

Skills

Incident Response CI/CD Monitoring Problem-solving Research Artificial Intelligence Infrastructure as Code

Job Specifications

This role is being offered as hybrid working based at any of our Core HQ’s.

We offer great flexible working opportunities at UKHSA and operate using a hybrid working model where business needs allow. This provides us with greater flexibility about how and where we work, to get the best from our workforce. As a hybrid worker, you will be expected to spend a minimum of 60% of your contractual working hours (approximately 3 days a week pro rata, (averaged over a month) working at one of UKHSA's core HQ’s (Birmingham, Leeds, Liverpool, and London).

Our core HQ offices are modern and newly refurbished with excellent city centre transport link and benefit from benefit from co-location with other government departments such as the Department for Health and Social Care (DHSC).

Job Summary

The Digital and Data Directorate has primary responsibility for scientific and research computing services and support. The key functions of the Digital Development and Operations unit are to provide and support such platforms required by the staff of The UK Health Security Agency, and to provide the technical capabilities to enable public health services, both within the Organisation and between the Organisation and its customers and stakeholders.

As a Specialist Site Reliability Engineer (SRE) You Will:

Remediate infrastructure and operational problems
Leverage automation and Continuous Integration/Continuous Delivery (CI/CD); ensuring our services run reliably, are scalable, and perform optimally
Monitor and manage these aspects while taking responsibility for multiple cloud infrastructure services
Observing systems will be key to prioritising the operational service improvements and performance improvements to meet/exceed SLOs (Service Level Objectives)

The Role Will Be Responsible To The Principal Specialist Engineer SRE And Is Part Of The High Performance Computing, Site Reliability Engineering , Artificial Intelligence (HPC/SRE/AI) & Research Computing Unit Whose Remit Is To:

Architect, develop & manage multi-cloud HPC platforms and on-premise infrastructure
Ensure services are highly available, scalable and resilient
Managing performance, capability and capacity planning
Support UKHSA's AI requirements

This role attracts a Market Pay Supplement of up to £5,000.

Working for your organisation

We pride ourselves as being an employer of choice, where Everyone Matters promoting equality of opportunity to actively encourage applications from everyone, including groups currently underrepresented in our workforce.  

UKHSA ethos is to be an inclusive organisation for all our staff and stakeholders. To create, nurture and sustain an inclusive culture, where differences drive innovative solutions to meet the needs of our workforce and wider communities. We do this through celebrating and protecting differences by removing barriers and promoting equity and equality of opportunity for all.  

Please visit our careers site for more information https://gov.uk/ukhsa/careers

Job Description

We are seeking a highly motivated and experienced SRE to join our HPC & SRE engineering team. As an SRE, you will play a critical role in ensuring the stability, scalability, and performance of our services. You will combine software engineering and systems engineering to build, improve and run reliable, scalable production systems.

Key Responsibilities

Service Reliability & Performance

Ensure services are stable, scalable, and performant through engineering best practices and system design.
Proactively identify and address system bottlenecks using advanced problem-solving and performance tuning techniques.
Conduct capacity planning and implement solutions to ensure systems can support current and future workloads.

Incident Response & Troubleshooting

Respond swiftly to production incidents, ensuring minimal downtime and quick restoration of services.
Perform root cause analysis and postmortems, implementing lessons learned to prevent recurrence.

Monitoring, Alerting & Observability

Contribute to the design and implementation of effective monitoring and alerting systems using tools and dashboards.
Improve observability of services, ensuring issues are identified and addressed before impacting users.
Continuously refine monitoring practices to reduce alert fatigue and improve response times.

Automation & Tooling

Develop automation to eliminate manual, repetitive tasks and improve operational efficiency.
Write clear, maintainable, and well-tested code to support automation efforts and system tooling.
Drive initiatives to reduce operational toil and improve reliability through Infrastructure as Code (IaC).

Service Level Objectives & Operational Improvements

Contribute to the definition, tracking, and continuous improvement of SLOs, Service Level Indicator’s (SLIs), and error budgets.
Identify and prioritize operational improvements that align with business goals and user experience.

SRE Best Practices & Advocacy

Helping to evangel

About the Company

The UK Health Security Agency (UKHSA) is an executive agency of the Department of Health and Social Care. The UK Health Security Agency (UKHSA) is responsible for planning, preventing and responding to external health threats, and providing intellectual, scientific and operational leadership at national and local level, as well as on the global stage. Know more