Job Description
Our company specializes in the development of animal health management solutions. We are a multidisciplinary product company, a diverse team of ~450 closely collaborating scientists, AI experts, software, hardware, and mechanical engineers… working alongside veterinarians and other animal experts. Our passion? Shaping the future of animal health and well-being (for much better!).
Our products and platforms identify trends and predict the likelihood of health outcomes for HUNDREDS of MILLIONS of animals each year, from pets, to poultry, farm animals, and even fish. We provide actionable insights for veterinarians, farmers, and producers, changing the way people care for animals in 150 markets.
So, if you’re looking to work in a company that combines pioneering science and technology, dedicated colleagues, and animals, you’ll find it all here – come join us!
We are looking for an exceptional Senior Site Reliability Engineer (SRE) to help establish and lead the technical practices of SRE within our CloudOps team. This is a hands-on role for an experienced professional who can implement SRE principles, build frameworks and tools to ensure system reliability, and mentor others in adopting these practices.
If you are passionate about operational excellence, love solving complex technical challenges, and thrive in highly collaborative environments, this is the role for you.
What You’ll Do:
Define and Build the SRE Function
· Help to define and implement the SRE principles and practices.
· Partner with development and DevOps teams to create Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical services.
· Advocate for and implement system architectures that prioritize reliability, scalability, and fault tolerance.
Develop Automation and Resilience
· Build automation tools to reduce toil, streamline operations, and improve reliability using Infrastructure as Code (IaC) tools like Terraform and CrossPlane.
· Implement self-healing systems, automate incident detection and response, and integrate chaos engineering practices to test system resilience.
Drive Observability and Monitoring Excellence
· Create and maintain advanced observability systems with tools like DataDog, Prometheus, and Grafana to ensure uptime and system health.
· Develop efficient alerting and monitoring strategies, including synthetic tests and automated anomaly detection.
· Strong proven experience with AWS services and using IAC with Terraform.
· Analyze system logs and telemetry data to detect patterns, identify issues, and optimize system performance.
Incident Response and Problem Solving
· Take ownership of incident response processes, ensuring swift recovery of services and conducting thorough Root Cause Analysis (RCA) for long-term improvements.
· Document incident learnings and collaborate with teams to enhance on-call processes and system documentation.
Contribute to Continuous Improvement
· Improve deployment pipelines (CI/CD) using tools like GitHub Actions, Azure DevOps, or ArgoCD, ensuring smooth and reliable releases.
· Continuously evaluate and refine operational processes to reduce manual effort and increase efficiency.
Requirements:
Technical Expert
· 5+ years of hands-on experience in Site Reliability Engineering.
· Proven expertise in AWS services, with experience working with distributed, event-driven architectures and microservices.
· Experience with GitOps workflows and tools.
· Advanced skills in automation tools like Terraform and proficiency in scripting or programming languages (e.g., Python, Go, Bash).
Problem Solver and Collaborator
· Exceptional problem-solving skills and a proactive approach to identifying and addressing technical challenges.
· Effective communicator and collaborator with the ability to work across teams to deliver operational excellence.
· Strong analytical skills, especially in troubleshooting and optimizing complex systems.
Preferred
· Familiarity with chaos engineering tools like Gremlin or LitmusChaos.
MDAHTL
Current Employees apply HERE
Current Contingent Workers apply HERE
Search Firm Representatives Please Read Carefully
Merck & Co., Inc., Rahway, NJ, USA, also known as Merck Sharp & Dohme LLC, Rahway, NJ, USA, does not accept unsolicited assistance from search firms for employment opportunities. All CVs / resumes submitted by search firms to any employee at our company without a valid written search agreement in place for this position will be deemed the sole property of our company. No fee will be paid in the event a candidate is hired by our company as a result of an agency referral where no pre-existing agreement is in place. Where agency agreements are in place, introductions are position specific. Please, no phone calls or emails.
Employee Status:
RegularRelocation:
VISA Sponsorship:
Travel Requirements:
Flexible Work Arrangements:
HybridShift:
Valid Driving License:
Hazardous Material(s):
Required Skills:
Artificial Intelligence (AI), Artificial Intelligence (AI), Automation, Automation Solutions, Availability Management, Capacity Management, Change Controls, Design Applications, High Performance Computing (HPC), Incident Management, Information Management, Information Technology (IT) Infrastructure, Infrastructure As Code (IaC), IT Service Management (ITSM), Microsoft Azure DevOps, Operational Excellence, Release Management, Reliability Engineering, SLA Management, Software Development, Software Development Life Cycle (SDLC), Solution Architecture, System Administration, System Designs, Systems Architecture {+ 4 more}Preferred Skills:
Job Posting End Date:
05/25/2025*A job posting is effective until 11:59:59PM on the day BEFORE the listed job posting end date. Please ensure you apply to a job posting no later than the day BEFORE the job posting end date.
Dynamic Director needed to lead biologics process development teams driving innovative upstream and downstream bioprocess technologies at an established pharmaceutical company.
Experienced Promotion Operations Specialist needed to enhance and manage promotional review and label update processes at Merck in North Wales, PA.
Manufacturing Engineer needed at LeoLabs to drive scalable production processes for next-generation radar hardware at their Menlo Park headquarters.
An Automation Engineer role at Sanofi's Swiftwater site supporting automated process control systems in biopharma manufacturing with a focus on DeltaV and related control systems.
Senior Engineering Manager needed at Savvy to lead engineering teams in delivering innovative, digital-first wealth management solutions.
Lead and inspire EPE’s Distribution Design team to develop innovative utility distribution system solutions with a focus on leadership and client engagement.
Experienced Bridge Engineer/Project Manager opportunity at KPFF to lead complex bridge and heavy civil engineering projects in Seattle.
Experienced Senior Fire Protection Engineer needed at Coffman Engineers to design and manage fire safety systems in a multi-discipline environment.
Contribute to Supabase's mission by creating technical content, engaging communities, and building projects as a Developer Relations Engineer in a fully remote environment.
Lead a dynamic team of engineers at Northrop Grumman to drive innovation and technical excellence in Tactical Fighters RF and power electronics design.
A seasoned Senior AutoCAD Designer 2 role at RETTEW supporting innovative land development projects with a hybrid work schedule.
Contribute your extensive front-end design verification expertise at Samsung SARC to advance scalable, high-performance semiconductor IP infrastructure.
A leading public power technology company seeks a Mechanical Engineering Lead to guide their talented R&D team in designing advanced power converters.
SHELADIA Associates seeks a Quality Control Area Engineer - Civil to oversee and implement quality control processes for civil engineering projects in the North East region.
Lead large-scale mechanical system projects as a Senior Mechanical Engineer at AECOM, combining expert engineering and project management in a dynamic industrial setting.
Our purpose: We use the power of leading-edge science to save and improve lives around the world
47 jobsSubscribe to Rise newsletter