cover image
CarltonOne

CarltonOne

carltonone.com

1 Job

203 Employees

About the Company

CarltonOne offers the world's most powerful eCommerce and Engagement platform for creating B2B employee recognition, customer loyalty, rewards, and sales/channel incentive programs. Recognized as one of the top 50 most inspiring workplaces in North America, CarltonOne helps our partners and clients operate programs with over 10 million rewards in over 185 countries. Every transaction on our platform fuels our Evergrow sustainability mission to fight climate change with a unique eco-action business model that is funding the planting of millions of trees every year.

Listed Jobs

Company background Company brand
Company Name
CarltonOne
Job Title
Site Reliability Engineering Manager
Job Description
**Job Title**: Site Reliability Engineering Manager **Role Summary**: Lead a team of Site Reliability Engineers to design, build, and operate cloud‑native infrastructure that is reliable, scalable, and secure. Own end‑to‑end incident management, observability, automation, and capacity planning while aligning reliability goals with business objectives across engineering, DevOps, security, and product functions. **Expectations**: - Manage and grow a high‑performance SRE team, cultivating ownership, continuous learning, and operational excellence. - Define SRE strategy, set SLOs/SLIs, and manage error budgets to balance innovation and reliability. - Drive incident response, blameless post‑mortems, and continuous improvement of reliability metrics. - Champion automation to reduce toil, optimize cost, and enhance system resilience. **Key Responsibilities**: 1. **Leadership & Strategy** - Mentor and grow SREs, leading performance reviews and career development. - Create and enforce SRE strategy, including SLOs, SLIs, and error‑budget management. - Collaborate with Engineering, DevOps, Security, and Product to align reliability objectives with business goals. 2. **Incident Management** - Own incident lifecycle: detection, response, post‑mortem, and follow‑up. - Develop runbooks, playbooks, and escalation procedures. - Track metrics (MTTR, MTTD, frequency, severity) and report trends. 3. **Monitoring & Observability** - Design monitoring architecture using Datadog, Grafana, CloudWatch, Prometheus, Rapid7 InsightCloudSec, Wiz, Cloudflare. - Set alert thresholds, create dashboards, and ensure actionable visibility into system health. 4. **Automation & Optimization** - Reduce manual toil via IaC (Terraform, CloudFormation, Helm), CI/CD pipelines (Bamboo, Jenkins, Ansible), and scripting. - Right‑size infrastructure, monitor cost, and drive performance tuning. 5. **Security & Compliance** - Implement IAM, RBAC, encryption, and vulnerability management best practices. - Use security monitoring tools and integrate compliance checks. 6. **Disaster Recovery & Capacity Planning** - Plan, test, and maintain RPO/RTO objectives. - Forecast capacity to support business growth and adjust resources proactively. **Required Skills**: - 7+ years in cloud infrastructure, DevOps, or SRE; 2+ years in leadership. - Deep expertise in AWS (EKS, EC2, S3, VPC, IAM, RDS Aurora, Lambda). - Strong Kubernetes, container orchestration, and service mesh knowledge. - IaC proficiency: Terraform, CloudFormation, Helm. - CI/CD & automation experience: Bamboo, Jenkins, Ansible. - Monitoring proficiency: Datadog, Grafana, CloudWatch, Prometheus. - Networking fundamentals: TCP/IP, DNS, load balancing, CDN. - Incident management and crisis leadership. - Excellent communication, stakeholder management, and cross‑functional collaboration. - Strategic thinking with focus on long‑term reliability and scalability. **Required Education & Certifications**: - Bachelor’s degree in Computer Science, Engineering, or related field. - AWS Certified Solutions Architect or equivalent is preferred. - SRE Practitioner, CKA, CKAD, or similar certifications are a plus.
Markham, Canada
Hybrid
Senior
15-11-2025