Job details

Site Reliability Engineer (SRE)

Get a free resume review

Job Description

Our company specializes in the development of animal health management solutions. We are a multidisciplinary product company, a diverse team of ~450 closely collaborating scientists, AI experts, software, hardware, and mechanical engineers… working alongside veterinarians and other animal experts. Our passion? Shaping the future of animal health and well-being (for much better!).

Our products and platforms identify trends and predict the likelihood of health outcomes for HUNDREDS of MILLIONS of animals each year, from pets, to poultry, farm animals, and even fish. We provide actionable insights for veterinarians, farmers, and producers, changing the way people care for animals in 150 markets.

So, if you’re looking to work in a company that combines pioneering science and technology, dedicated colleagues, and animals, you’ll find it all here – come join us!

We are looking for an exceptional Senior Site Reliability Engineer (SRE) to help establish and lead the technical practices of SRE within our CloudOps team. This is a hands-on role for an experienced professional who can implement SRE principles, build frameworks and tools to ensure system reliability, and mentor others in adopting these practices.

If you are passionate about operational excellence, love solving complex technical challenges, and thrive in highly collaborative environments, this is the role for you.

What You’ll Do:

Define and Build the SRE Function

· Help to define and implement the SRE principles and practices.

· Partner with development and DevOps teams to create Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical services.

· Advocate for and implement system architectures that prioritize reliability, scalability, and fault tolerance.

Develop Automation and Resilience

· Build automation tools to reduce toil, streamline operations, and improve reliability using Infrastructure as Code (IaC) tools like Terraform and CrossPlane.

· Implement self-healing systems, automate incident detection and response, and integrate chaos engineering practices to test system resilience.

Drive Observability and Monitoring Excellence

· Create and maintain advanced observability systems with tools like DataDog, Prometheus, and Grafana to ensure uptime and system health.

· Develop efficient alerting and monitoring strategies, including synthetic tests and automated anomaly detection.

· Strong proven experience with AWS services and using IAC with Terraform.

· Analyze system logs and telemetry data to detect patterns, identify issues, and optimize system performance.

Incident Response and Problem Solving

· Take ownership of incident response processes, ensuring swift recovery of services and conducting thorough Root Cause Analysis (RCA) for long-term improvements.

· Document incident learnings and collaborate with teams to enhance on-call processes and system documentation.

Contribute to Continuous Improvement

· Improve deployment pipelines (CI/CD) using tools like GitHub Actions, Azure DevOps, or ArgoCD, ensuring smooth and reliable releases.

· Continuously evaluate and refine operational processes to reduce manual effort and increase efficiency.

Requirements:

Technical Expert

· 5+ years of hands-on experience in Site Reliability Engineering.

· Proven expertise in AWS services, with experience working with distributed, event-driven architectures and microservices.

· Experience with GitOps workflows and tools.

· Advanced skills in automation tools like Terraform and proficiency in scripting or programming languages (e.g., Python, Go, Bash).

Problem Solver and Collaborator

· Exceptional problem-solving skills and a proactive approach to identifying and addressing technical challenges.

· Effective communicator and collaborator with the ability to work across teams to deliver operational excellence.

· Strong analytical skills, especially in troubleshooting and optimizing complex systems.

Preferred

· Familiarity with chaos engineering tools like Gremlin or LitmusChaos.

MDAHTL

Current Employees apply HERE

Current Contingent Workers apply HERE

Search Firm Representatives Please Read Carefully
Merck & Co., Inc., Rahway, NJ, USA, also known as Merck Sharp & Dohme LLC, Rahway, NJ, USA, does not accept unsolicited assistance from search firms for employment opportunities. All CVs / resumes submitted by search firms to any employee at our company without a valid written search agreement in place for this position will be deemed the sole property of our company. No fee will be paid in the event a candidate is hired by our company as a result of an agency referral where no pre-existing agreement is in place. Where agency agreements are in place, introductions are position specific. Please, no phone calls or emails.

Employee Status:

Regular

Relocation:

VISA Sponsorship:

Travel Requirements:

Flexible Work Arrangements:

Hybrid

Shift:

Valid Driving License:

Hazardous Material(s):

Required Skills:

Artificial Intelligence (AI), Artificial Intelligence (AI), Automation, Automation Solutions, Availability Management, Capacity Management, Change Controls, Design Applications, High Performance Computing (HPC), Incident Management, Information Management, Information Technology (IT) Infrastructure, Infrastructure As Code (IaC), IT Service Management (ITSM), Microsoft Azure DevOps, Operational Excellence, Release Management, Reliability Engineering, SLA Management, Software Development, Software Development Life Cycle (SDLC), Solution Architecture, System Administration, System Designs, Systems Architecture {+ 4 more}

Preferred Skills:

Job Posting End Date:

05/25/2025

*A job posting is effective until 11:59:59PM on the day BEFORE the listed job posting end date. Please ensure you apply to a job posting no later than the day BEFORE the job posting end date.

Site Reliability Engineering AWS Automation Monitoring Incident Response DevOps

Similar Jobs

Director, Biologics Process Development (BPD)

MSD Hybrid USA - New Jersey - Rahway

Report this job

Site Reliability Engineer (SRE)

Sign up for our weekly newsletter of fresh jobs

Sign up for our weekly
newsletter of fresh jobs