Skills

Communication Python Bash PowerShell SQL MySQL Splunk CI/CD Kubernetes Monitoring Ansible Test Agile Analytics OpenShift Chaos Engineering Prometheus Grafana Infrastructure as Code

Job Specifications

Shift Pattern:

Standard 40 Hour Week (United Kingdom)

Scheduled Weekly Hours:

40

Corporate Grade:

D - Assistant Vice President

Reporting Line:

(UK Division) Information Technology

Location:

UK-London

Worker Type:

Permanent

Overall, Purpose of Role

Deliver Level 2 and Level 3 technical support with a strong focus on reliability, resilience, and automation for the LME’s Middle Office, Back Office, and Market Data mission-critical applications. This role blends traditional application support with SRE principles and platform engineering practices to ensure stability, scalability, and continuous improvement across systems serving internal teams and external clients.

Core Responsibilities

Reliability Engineering: Embed SRE best practices into operational workflows, including error budgets, SLIs/SLOs, and proactive monitoring to improve system uptime and performance.
Design, Build, migrate, support, optimise and manage our physical, virtual, containerised Openshift and Kubernetes environments for scalability, resilience, and operational efficiency, with a focus on a ‘Five nines’ operational availability.
Deliver technical support within a project-based framework to ensure successful application rollouts.
Support project delivery across Waterfall and Agile frameworks, with an emphasis on Hybrid approaches to ensure both flexibility and efficiency.
Prioritise and resolve incidents across the full application suite, ensuring rapid recovery and root cause analysis.
Identify and implement service improvements with measurable outcomes, focusing on automation and reducing TOIL.
Manage day-to-day production incidents and validate changes through QA and automated testing pipelines.
Troubleshoot issues across network, database, infrastructure, and application layers.
Actively contribute to incident, change, and problem management processes.

Key Accountabilities:

Provide support and maintenance for mission-critical applications across Pre- and Post-Trade business units.
Support delivery of regulatory, market growth, infrastructure, and security projects.
Monitor and optimise application performance using observability tools and proactive tuning.
Maintain and support both test and production environments for stability and readiness.
Champion automation, CI/CD, and self-healing systems to reduce manual intervention.
Oversee end-to-end release management, ensuring smooth deployments with minimal risk.
Drive continuous improvement by evaluating and enhancing support processes.
Maintain up-to-date documentation for all supported systems and platforms.
Lead operational resiliency exercises, including disaster recovery and chaos engineering tests.
Identify, manage, and remediate security vulnerabilities across systems and applications.

Technical Responsibilities

Maintain and regularly test disaster recovery procedures.
Recommend and implement standards to enhance environment efficiency and resilience.
Validate system builds against operational and reliability requirements.
Respond promptly to production issues, ensuring resolution and stakeholder communication.
Support 24/7/365 system availability for production systems, this will incorporate working flexible shift patterns (07:00–16:00 and 10:00–19:00), including participation in on-call and weekend rota to cover out of hours, alongside the HK team.
Participate in on-call and weekend rota for out-of-hours coverage.
CI/CD Pipeline Management: Design, implement, and maintain pipelines using Bamboo alongside BitBucket
Infrastructure as Code (IaC): Champion IaC and help to build and manage our application and environment releases via Ansible tower.
Platform Management and Availability: Build, migrate, support, optimise and manage our physical, virtual, containerised Openshift and Kubernetes environments for scalability, resilience, and operational efficiency, with a focus on a ‘Five nines’ operational availability.
Monitoring & Observability: Design, implement, and maintain observability stacks with a primary focus on Grafana and Prometheus for real-time metrics and dashboards, complemented by Splunk for log analytics and incident investigation. Define and track SLIs/SLOs to ensure reliability and performance across platforms.
Implement and “plug in” offensive monitoring rules within the Grafana stack to anticipate and predict potential system failures or performance degradation, with a view to enabling early intervention and improved service resilience.
Automation: Automate repetitive tasks using Python, Bash, or PowerShell.

Working with others:

Internal production teams
Business stakeholders
Project teams
Risk, Security, and GRC teams
External vendors and auditors

PERSON SPECIFICATION:

Qualifications

Degree in Computer Science or a related discipline OR 5+ years equivalent professional experience

Preferred Experience

Strong SQL and database expertise (MySQL, Oracle, Liquibase).
Experience with CI/CD, IaC, containerisation, and orchestration tools.
Strong e

About the Company

The London Metal Exchange (LME) is the world centre for industrial metals trading. Most of the world's global non-ferrous futures business is conducted on our markets, totalling US$15.2 trillion of notional value, 3.1 billion tonnes and 145 million lots last year. Participants from non-ferrous, ferrous, electric vehicle and financial communities use the LME to mitigate or take on risk using globally trusted prices discovered on our markets. LME futures contracts are underpinned by a world-wide network of warehouses ensur... Know more