Skills

Communication Python Bash Incident Response GitHub CI/CD DevOps Docker Kubernetes Monitoring Configuration Management Jenkins Ansible Problem-solving Networking Linux Windows Azure AWS cloud platforms GCP CI/CD Pipelines Chaos Engineering Terraform Prometheus Grafana Infrastructure as Code Microservices GitHub Actions

Job Specifications

ELLKAY started out providing connectivity solutions to laboratories and within a few years, grew to also provide data management solutions to ambulatory organizations. ELLKAY is now a trusted data management partner in five healthcare segments. ELLKAY’s solutions continue to serve laboratories and ambulatory practices and have expanded to empower hospitals and health systems, healthcare IT vendors, ambulatory practices, health plans, and other healthcare organizations with cutting-edge technologies and solutions that drive their growth and interoperability strategies.

Today, ELLKAY remains true to our core values, building strong partner relationships and offering unparalleled service and support while providing innovative, scalable solutions to the challenges our customers face in today’s data-rich world.

ELLKAY's experience, customer-focused approach, and reputation for innovation, speed, and accuracy differentiate ELLKAY as a premier partner for your interoperability needs and data management strategy.

Job Description

We are looking for an Application Site Reliability Engineer (SRE) with strong DevOps experience to improve the reliability, scalability, and performance of our applications.

The Application Site Reliability Engineer will serve as a technical contact responsible for driving the reliability, performance, and operational maturity of our application ecosystem. This role works across multiple teams to support scalable systems, establish reliability standards, improve observability, and implement automation that reduces operational effort. The SRE will lead complex incident responses, work with engineering teams in best practices, and influence architectural decisions to ensure resilient, high-quality software delivery.

You will help define reliability standards, reduce operational toil, and ensure smooth production operations while enabling faster and safer releases.

Essential Duties & Responsibilities

Own application reliability, availability, performance, and scalability in production and non-production environments
Design, build, and maintain CI/CD pipelines for application deployments
Automate infrastructure provisioning and configuration using Infrastructure as Code
Monitor application health using metrics, logs, and traces; define SLIs, SLOs, and error budgets
Lead incident response, root-cause analysis (RCA), ensuring corrective and preventive actions are completed and communicated.
Improve system resilience through capacity planning, system tuning, and fault tolerance
Partner with development teams to ensure services meet reliability, performance, and scalability objectives.
Reduce manual operational effort through automation and self-healing solutions
Serve as a point of contact for critical Sev1/Sev2 incidents, leading incident command when required.

Qualifications

Strong experience as an SRE, DevOps Engineer, or Production Support Engineer
Solid understanding of Windows, Linux/Unix systems and networking fundamentals
7 years of experience as an SRE
Hands-on experience with cloud platforms such as AWS, Azure, or GCP
Experience with containerization and orchestration tools like Docker and Kubernetes
Proficiency in CI/CD tools such as Jenkins, GitHub Actions, , or similar
Experience with Infrastructure as Code tools like Terraform, CloudFormation, or ARM
Strong scripting skills in Python, Bash, or similar languages
Experience with monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, etc.)
Understanding of reliability concepts such as SLAs, SLOs, and incident management

Preferred Qualifications

Experience supporting microservices-based architectures
Knowledge of security best practices in cloud and DevOps environments
Experience with configuration management tools (Ansible, Chef, or Puppet)
Exposure to chaos engineering or resilience testing practices

Soft Skills

Strong problem-solving and troubleshooting skills
Ability to work calmly during incidents and high-pressure situations
Clear communication and collaboration with cross-functional teams
Ownership mindset with a focus on continuous improvement

What We Offer

Opportunity to work on highly available, business-critical applications
Collaborative engineering culture with strong DevOps and SRE practices
Competitive compensation and benefits
Learning and growth opportunities in cloud, automation, and reliability engineering

Benefits

ELLKAY offers a comprehensive and competitive benefit package that starts day one!

Including

Medical, Dental, and Vision benefits
Employer-paid Life and LTD
401k w/ matching – once eligibility is met
Work/life balance
Paid Volunteer Program
Flexible working hours
Generous FTO
Remote work options
Employee Discounts
Parental Leave

Our Awesome Culture Includes

Working with talented, collaborative, and friendly people who love what they do
Professional growth within
Innovation environment
On site in HQ Free daily lunches

Additional Information

At ELLKAY, we are committed t

About the Company

ELLKAY is a trusted enterprise data management partner, driving innovation and connectivity across the healthcare ecosystem. Since 2002, ELLKAY has empowered hospitals, laboratories, payers, healthcare IT vendors, and more with unmatched data management expertise. With connections to over 58,000 practices and interoperability with 750+ EHR/PM systems across 1,100+ versions, ELLKAY delivers solutions that streamline data exchange, fuel value-based care, and drive smarter decision making. Join us virtually August 5-6 https://... Know more