Skills

Communication Python Bash PowerShell SQL MySQL Splunk CI/CD Kubernetes Monitoring Ansible Test Agile Analytics OpenShift Chaos Engineering Prometheus Grafana Infrastructure as Code

Job Specifications

Shift Pattern:

Standard 40 Hour Week (United Kingdom)

Scheduled Weekly Hours:

40

Corporate Grade:

D - Assistant Vice President

Reporting Line:

(UK Division) Information Technology

Location:

UK-London

Worker Type:

Permanent

Overall, Purpose of Role

Deliver Level 2 and Level 3 technical support with a strong focus on reliability, resilience, and automation for the LME’s Middle Office, Back Office, and Market Data mission-critical applications. This role blends traditional application support with SRE principles and platform engineering practices to ensure stability, scalability, and continuous improvement across systems serving internal teams and external clients.

Core Responsibilities

Reliability Engineering: Embed SRE best practices into operational workflows, including error budgets, SLIs/SLOs, and proactive monitoring to improve system uptime and performance.
Design, Build, migrate, support, optimise and manage our physical, virtual, containerised Openshift and Kubernetes environments for scalability, resilience, and operational efficiency, with a focus on a ‘Five nines’ operational availability.
Deliver technical support within a project-based framework to ensure successful application rollouts.
Support project delivery across Waterfall and Agile frameworks, with an emphasis on Hybrid approaches to ensure both flexibility and efficiency.
Prioritise and resolve incidents across the full application suite, ensuring rapid recovery and root cause analysis.
Identify and implement service improvements with measurable outcomes, focusing on automation and reducing TOIL.
Manage day-to-day production incidents and validate changes through QA and automated testing pipelines.
Troubleshoot issues across network, database, infrastructure, and application layers.
Actively contribute to incident, change, and problem management processes.

Key Accountabilities:

Provide support and maintenance for mission-critical applications across Pre- and Post-Trade business units.
Support delivery of regulatory, market growth, infrastructure, and security projects.
Monitor and optimise application performance using observability tools and proactive tuning.
Maintain and support both test and production environments for stability and readiness.
Champion automation, CI/CD, and self-healing systems to reduce manual intervention.
Oversee end-to-end release management, ensuring smooth deployments with minimal risk.
Drive continuous improvement by evaluating and enhancing support processes.
Maintain up-to-date documentation for all supported systems and platforms.
Lead operational resiliency exercises, including disaster recovery and chaos engineering tests.
Identify, manage, and remediate security vulnerabilities across systems and applications.

Technical Responsibilities

Maintain and regularly test disaster recovery procedures.
Recommend and implement standards to enhance environment efficiency and resilience.
Validate system builds against operational and reliability requirements.
Respond promptly to production issues, ensuring resolution and stakeholder communication.
Support 24/7/365 system availability for production systems, this will incorporate working flexible shift patterns (07:00–16:00 and 10:00–19:00), including participation in on-call and weekend rota to cover out of hours, alongside the HK team.
Participate in on-call and weekend rota for out-of-hours coverage.
CI/CD Pipeline Management: Design, implement, and maintain pipelines using Bamboo alongside BitBucket
Infrastructure as Code (IaC): Champion IaC and help to build and manage our application and environment releases via Ansible tower.
Platform Management and Availability: Build, migrate, support, optimise and manage our physical, virtual, containerised Openshift and Kubernetes environments for scalability, resilience, and operational efficiency, with a focus on a ‘Five nines’ operational availability.
Monitoring & Observability: Design, implement, and maintain observability stacks with a primary focus on Grafana and Prometheus for real-time metrics and dashboards, complemented by Splunk for log analytics and incident investigation. Define and track SLIs/SLOs to ensure reliability and performance across platforms.
Implement and “plug in” offensive monitoring rules within the Grafana stack to anticipate and predict potential system failures or performance degradation, with a view to enabling early intervention and improved service resilience.
Automation: Automate repetitive tasks using Python, Bash, or PowerShell.

Working with others:

Internal production teams
Business stakeholders
Project teams
Risk, Security, and GRC teams
External vendors and auditors

PERSON SPECIFICATION:

Qualifications

Degree in Computer Science or a related discipline OR 5+ years equivalent professional experience

Preferred Experience

Strong SQL and database expertise (MySQL, Oracle, Liquibase).
Experience with CI/CD, IaC, containerisation, and orchestration tools.
Strong e

About the Company

HKEX Group is a global exchange group, operating dynamic and integrated financial markets in Asia and Europe. From our home in the financial hub of Hong Kong and an additional base in London, we provide world-class facilities for trading and clearing securities and derivatives in Equities, Commodities, Fixed Income and Currency. Uniquely positioned at the intersection of Chinese and international capital flows, Hong Kong has long been Connecting China with the World. With the accelerated opening-up of China's capital marke... Know more