Job Specifications
About The Opportunity
An established player in Financial Technology and Enterprise Cloud Infrastructure, delivering resilient, high-throughput systems that support mission-critical institutional workloads. We operate large-scale distributed services and are investing in reliability, observability, and automation to meet aggressive SLAs across the business.
Location: United States (On-site)
Role & Responsibilities
Lead and grow a high-performing Site Reliability Engineering team responsible for production availability, incident response, and operational excellence.
Define and own SLIs, SLOs, SLA frameworks and a reliability roadmap; translate business requirements into measurable reliability targets.
Drive incident management and postmortem culture: lead major incidents, coordinate cross-functional response, and implement corrective actions to eliminate repeat failures.
Architect and implement observability, monitoring, and alerting solutions to provide actionable signal (metrics, logs, tracing) and reduce MTTD/MTTR.
Improve platform scalability and resilience through automation, CI/CD pipelines, infrastructure-as-code, capacity planning and performance testing.
Partner with Engineering, Security, and Product teams to influence architecture, deploy robust runbooks, and bake reliability into the development lifecycle.
Skills & Qualifications
Must-Have
Kubernetes
Docker
Prometheus
Grafana
Terraform
AWS
Preferred
Go
Python
Jenkins
Qualifications & Experience
Proven experience leading SRE/Platform teams in production; track record owning reliability for distributed systems.
Strong understanding of incident management, postmortem discipline, capacity planning, and on-call rotations.
Hands-on experience with cloud-native architectures, IaC, and CI/CD practices; able to both lead strategy and contribute technically.
Benefits & Culture Highlights
Opportunity to shape reliability for large-scale, mission-critical systems with measurable business impact.
Collaborative engineering culture that prioritizes automation, continuous improvement, and transparent postmortems.
On-site team environment focused on mentorship, career growth, and technical leadership.
We seek a strategic SRE leader who combines deep operational expertise with people leadership to drive measurable uptime and velocity improvements. If you are passionate about observability, incident prevention, and building reliable cloud platforms, we want to hear from you.
Skills: kubernetes,docker,prometheus,grafana,terraform,aws,ci,cd,cloud,reliability
About the Company
Black Rock Groups Inc | Elevating Human Resource Solutions in the U.S.
At Black Rock Groups Inc, we specialize in providing top-tier human resource services tailored to meet the evolving needs of businesses across the United States. Our expertise spans talent acquisition, workforce management, employee engagement, compliance, and strategic HR consulting.
We empower organizations by delivering customized HR solutions that drive efficiency, productivity, and long-term growth. Whether you're a startup looking to build a stron...
Know more