Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Software Engineer - Resilience image - Rise Careers
This job is expired We're automatically mark job as expired after 180 days of its inactivity
Job details

Software Engineer - Resilience - job 1 of 3

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—trillions of data points per day—allowing for seamless collaboration and problem-solving among Dev, Ops and Security teams globally for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.

 

The Team:

The Resilience Engineering group at Datadog focuses on improving resilience in our software and staff. We work on defining our on-call tooling and incident response process for the entire company, constantly iterating on it through the lessons we learn from production. We help out during the most complex production incidents - our Resilience Engineers excel in troubleshooting and have a passion for problem solving and efficiency. We also build the Chaos Platform and Tooling so that engineers can use a measured approach to break and test for system resilience and reproduce past bugs/incidents to verify their remediation. Being prepared to deal with unknown failures both from a technical and organizational standpoint is the core work of Chaos Engineers.

 

Location:

We are a globally distributed team with US Offices in New York (HQ), Boston, and Denver and International Offices in Paris, Dublin, London, Madrid, the Netherlands, and Singapore. About 33% of our engineering team are remote.

Datadog values people from all walks of life. We understand that not everyone will meet these requirements on day one. If you’re passionate about reliability engineering and want to grow these skills but don’t meet all of these qualifications, we encourage you to apply.

 

You Will:

  • You will help review complex issues in production and write postmortems in partnership with other engineering teams.
  • You will get to contribute in the development of our self-service chaos platform implemented on top of Kubernetes.
  • You will get to help define for the whole company how we respond to incidents and build tooling along the way to streamline that process. You will also help train our on-call staff, preparing newcomers to their on-call responsibilities but also refreshing the rest of the staff with what we’ve learnt from past incidents.

 

You Are:

  • 5+ years of software or reliability engineering experience.
  • An analytical mindset and willingness to dive into unfamiliar code bases and find obscure bugs.
  • You have architected, built, and operated distributed systems to solve problems at high scale in cloud-based environments.
  • You have been on-call for critical systems and you have experience handling incidents using a formal organization process.
  • You want to work in a fast-paced, high-growth environment that respects its engineers and customers.

 

Bonus Points:

  • You've worked on chaos or resilience engineering projects before.
  • You’ve been an Incident Commander or have contributed to defining an incident response process.
  • You have Linux/Kubernetes experience.
  • You have experience in cross-team, cross-functional projects.

This is a remote position

 



Equal Opportunity at Datadog:

Datadog is an Affirmative Action and Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.

 

Your Privacy:

Any information you submit to Datadog as part of your application will be processed in accordance with Datadog’s Applicant and Candidate Privacy Notice.

Datadog (NYSE: DDOG) is a prominent global SaaS provider that uniquely balances growth and profitability. It offers cloud-scale monitoring and security by combining metrics, traces, and logs within one platform.

130 jobs
BADGES
Badge Diversity ChampionBadge Future MakerBadge Office VibesBadge Future UnicornBadge Rapid Growth
CULTURE VALUES
Customer-Centric
Rapid Growth
Diversity of Opinions
Reward & Recognition
Friends Outside of Work
Inclusive & Diverse
Empathetic
Feedback Forward
Work/Life Harmony
Casual Dress Code
Startup Mindset
Collaboration over Competition
Fast-Paced
Growth & Learning
Open Door Policy
Rise from Within
BENEFITS & PERKS
Maternity Leave
Paternity Leave
Flex-Friendly
Family Coverage (Insurance)
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401K Matching
Paid Holidays
Paid Sick Days
Paid Time-Off
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
DATE POSTED
January 13, 2022

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!