Job details

Staff Site Reliability Engineer

Get a free resume review

The Wikimedia Foundation is seeking a Staff Site Reliability Engineer focused on Machine Learning Infrastructure to enhance our ML systems, collaborating with diverse teams globally.

Skills

7+ years of SRE or related experience
Expertise in on-premises ML infrastructure
Proficiency with automation and configuration management tools
Experience with monitoring and logging for ML systems
Familiarity with Python-based ML frameworks

Responsibilities

Design and implement robust ML infrastructure for training and deployment
Improve reliability, availability, and scalability of ML infrastructure
Collaborate with ML engineers, product teams, and researchers
Monitor and optimize system performance and security
Provide guidance and documentation for using ML infrastructure
Mentor team members on operational excellence and reliability

Education

Bachelor's degree in Computer Science or related field

Benefits

Competitive salaries aligned with values and culture
Diverse and inclusive workplace
Remote-first organization with flexibility in work

To read the complete job description, please click on the ‘Apply’ button

Average salary estimate

$165085.5 / YEARLY (est.)

min

max

$129347K

$200824K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Staff Site Reliability Engineer, Wikimedia Foundation

The Wikimedia Foundation is on the lookout for a talented Staff Site Reliability Engineer (SRE) focused on Machine Learning Infrastructure to join our dynamic remote team. You will collaborate with amazing colleagues across various time zones, from Eastern Americas to Europe and Africa, reporting directly to our Director of Machine Learning, Chris Albon. In this role, you’ll lead the design, development, maintenance, and scaling of the crucial infrastructure that powers our Machine Learning Engineers and Researchers’ efforts in training, deploying, and monitoring machine learning models. Your day-to-day will encompass sketching out robust ML infrastructure, enhancing the reliability and scalability of our systems, and working hand-in-hand with cross-functional teams to streamline operational processes. We’re seeking someone who can proactively monitor system performance and security while sharing their insights through collaboration and documentation. Plus, mentoring fellow team members is a big part of fostering our culture. With over 7 years of experience under your belt, particularly in SRE, DevOps, or infrastructure engineering, you’ll bring significant expertise in managing production-grade machine learning systems. If you thrive in an open-source environment and value teamwork within diverse, remote teams, then this opportunity at the Wikimedia Foundation is tailor-made for you. Join us in making knowledge freely accessible, as we believe that together, we can contribute to a world where everyone benefits from shared knowledge!

Frequently Asked Questions (FAQs) for Staff Site Reliability Engineer Role at Wikimedia Foundation

What are the responsibilities of a Staff Site Reliability Engineer at the Wikimedia Foundation?

At the Wikimedia Foundation, a Staff Site Reliability Engineer (SRE) is primarily responsible for designing, developing, maintaining, and scaling our ML infrastructure. This includes implementing systems for machine learning model training, deployment, and monitoring, collaborating closely with other teams to enhance system reliability and performance, and providing documentation and guidance for best practices in ML infrastructure usage.