Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Senior Site Reliability Engineer image - Rise Careers
Job details

Senior Site Reliability Engineer

About Us

We believe AI will fundamentally transform how people live and work. CentML's mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.


Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts at companies like Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who holds multiple academic and industry research awards from Google, Amazon, Facebook, and VMware.


About the Position

As a Senior Site Reliability Engineer, you will play a pivotal role in shaping the infrastructure and reliability practices at CentML. You will be responsible for leading complex projects, mentoring other SREs, and collaborating with cross-functional teams to ensure our systems meet the highest standards of reliability, performance, and security. This is a senior-level position, ideal for individuals with deep technical expertise and leadership experience in SRE.



What you’ll do:


Leadership & Strategy:

- Design, implementation, and operation of highly reliable, scalable, and secure ML infrastructure.

- Develop and drive SRE best practices across the organization, setting the standards for operational excellence.


Technical Excellence:

- Architect and build large-scale, distributed systems that support complex ML workloads, ensuring high availability and fault tolerance.

- Lead efforts in automation, configuration management, and infrastructure-as-code, minimizing manual operations and ensuring consistency.

- Optimize the performance and scalability of our systems, identifying and addressing bottlenecks before they impact users.


Incident Management & Response:

- Lead incident response efforts, including real-time troubleshooting, root cause analysis, and postmortem reviews.

- Develop and maintain comprehensive monitoring, alerting, and logging systems that provide deep visibility into system health and performance.


Continuous Improvement & Innovation:

- Drive continuous improvement in system reliability, performance, and scalability through the adoption of new technologies, tools, and methodologies.

- Stay current with industry trends and innovations in SRE and ML infrastructure, bringing new ideas and approaches to the team.



What you’ll need to be successful
  • 5+ years of experience in Site Reliability Engineering, DevOps, or related roles, with significant experience in leading and managing large-scale infrastructure projects.
  • Proven track record of building and operating highly reliable, scalable, and secure systems in a production environment.
  • Deep expertise in cloud platforms (e.g., AWS, GCP, Azure), containerization (e.g., Docker, Kubernetes), and infrastructure-as-code (e.g., Terraform, Ansible).
  • Advanced proficiency in scripting and automation using languages such as Python, Bash, or similar.
  • Strong understanding of distributed systems, networking, and storage solutions, with the ability to architect complex systems from the ground up.
  • Demonstrated experience in leading technical teams and projects, with the ability to mentor and develop other engineers.
  • Excellent problem-solving skills, with a proactive approach to identifying and resolving issues before they impact the business.
  • Strong communication and collaboration skills, with the ability to work effectively across different teams and stakeholders.
  • Ability to operate effectively in a fast-paced, dynamic startup environment, with a focus on delivering results.


Benefits & Perks

- An open and inclusive work environment

- Employee stock options

- Best-in-class medical and dental benefits

- Parental Leave top-up

- Professional development budget

- Flexible vacation time to promote a healthy work-life blend


We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, and any other protected ground of discrimination under applicable human rights legislation. 


CentML strives to respect the dignity and ‎‎independence of people with disabilities and is committed to giving them the same ‎‎opportunity to succeed as all other employees. 


Inclusiveness is core to our culture at CentML, and we strive to ensure you get the most from your interview experience. CentML makes reasonable accommodations for applicants with disabilities. If a reasonable accommodation is needed to participate in the job application or interview process, please reach out to the Talent team.

CentML Glassdoor Company Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CentML DE&I Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of CentML
CentML CEO photo
Unknown name
Approve of CEO

Average salary estimate

$150000 / YEARLY (est.)
min
max
$120000K
$180000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Senior Site Reliability Engineer, CentML

Join CentML as a Senior Site Reliability Engineer, where you’ll be at the forefront of transforming how we build and deploy machine learning models. With a mission to democratize AI, CentML is focused on reducing the complexities and costs associated with ML processes. In this pivotal role, you’ll design and implement robust, scalable infrastructures, ensuring that our systems uphold the highest standards of reliability and security. You’ll lead complex projects and foster a culture of excellence by implementing SRE best practices across the organization. Your expertise in cloud platforms like AWS, GCP, or Azure, along with deep knowledge in containerization technologies such as Docker and Kubernetes, will be essential for optimizing performance and troubleshooting incidents. You’ll collaborate with cross-functional teams, mentor fellow engineers, and drive continuous improvement initiatives, all while staying on top of the latest innovations within the SRE field. If you’re passionate about creating seamless ML experiences and have a knack for problem-solving, we're excited to see how you can contribute to our team!

Frequently Asked Questions (FAQs) for Senior Site Reliability Engineer Role at CentML
What qualifications do I need to apply for the Senior Site Reliability Engineer position at CentML?

To apply for the Senior Site Reliability Engineer role at CentML, you’ll need at least 5 years of experience in Site Reliability Engineering or related fields. We’re looking for candidates with a proven track record in managing large-scale infrastructure projects and a deep understanding of cloud platforms like AWS, GCP, or Azure, alongside expertise in container technologies such as Docker and Kubernetes.

Join Rise to see the full answer
What are the key responsibilities of a Senior Site Reliability Engineer at CentML?

As a Senior Site Reliability Engineer at CentML, your key responsibilities include designing and implementing scalable and secure ML infrastructures, leading incident response efforts, and developing monitoring systems to ensure system health. You will also mentor other engineers and work collaboratively across teams to establish best practices in SRE.

Join Rise to see the full answer
What technical skills are essential for the Senior Site Reliability Engineer role at CentML?

Essential technical skills for the Senior Site Reliability Engineer role at CentML include advanced proficiency in scripting using languages like Python and Bash, expertise in infrastructure-as-code tools such as Terraform and Ansible, and a strong understanding of distributed systems and networking. Familiarity with automation tools and a proactive approach to problem-solving are also crucial.

Join Rise to see the full answer
How does CentML support the professional development of its Senior Site Reliability Engineers?

CentML supports the professional development of its Senior Site Reliability Engineers through a dedicated budget for learning and growth. This means you can attend workshops, conferences, or training sessions to expand your skills and stay up-to-date with the latest industry developments in SRE and machine learning infrastructure.

Join Rise to see the full answer
Can you describe the work culture for a Senior Site Reliability Engineer at CentML?

The work culture for a Senior Site Reliability Engineer at CentML is open, inclusive, and focused on innovation. We highly value diversity and encourage collaboration across different teams. With flexible vacation time and a strong commitment to work-life balance, we aim to create an environment where everyone can thrive and contribute their best work.

Join Rise to see the full answer
Common Interview Questions for Senior Site Reliability Engineer
What excites you about the Senior Site Reliability Engineer position at CentML?

When answering this question, focus on your passion for machine learning and infrastructure reliability. Discuss how CentML's mission resonates with your career aspirations and how you are eager to contribute to shaping innovative solutions that democratize AI.

Join Rise to see the full answer
Can you explain your experience with incident management in a production environment?

Highlight a specific incident you managed, describing your role in the troubleshooting process. Emphasize your approach to root cause analysis and how you ensured that similar issues would be avoided in the future, focusing on continuous improvement initiatives.

Join Rise to see the full answer
How do you approach designing scalable systems for complex workloads?

Explain your architectural design processes, emphasizing your understanding of both horizontal and vertical scaling concepts, and mention the technologies you use to achieve scalability. Providing a real-world example can demonstrate your hands-on experience and thought process.

Join Rise to see the full answer
What tools do you use for monitoring system performance?

Discuss the specific tools you are familiar with, such as Prometheus, Grafana, or ELK Stack, and elaborate on how you utilize these tools to gain insights into system performance, detect issues, and ensure high availability and reliability.

Join Rise to see the full answer
Can you describe your experience with automation and configuration management?

Detail the automation tools you have employed, like Terraform or Ansible, and describe a project where automation significantly reduced manual oversight, increased deployment speed, or minimized errors, showcasing your proactive approach to infrastructure.

Join Rise to see the full answer
How do you prioritize tasks when working on multiple projects?

Share your strategies for time management and prioritization, such as using agile methodologies or task management tools. Emphasize the importance of clear communication with your team and stakeholders to align priorities with overall business goals.

Join Rise to see the full answer
What has been your experience leading technical teams and mentoring others?

Reflect on specific instances where you took the lead on a project or mentored junior engineers. Discuss your approach to fostering a collaborative environment and encouraging skill development among your team members.

Join Rise to see the full answer
How do you stay updated with the latest trends in Site Reliability Engineering?

Mention your commitment to continuous learning through industry publications, attending webinars, following thought leaders on social media, or participating in professional organizations. Express how this knowledge can positively influence your role at CentML.

Join Rise to see the full answer
What strategies do you implement to enhance system reliability and security?

Talk about your experience in performing regular security assessments and your use of best practices such as implementing least privilege access and using automated security scanning tools. Discuss how you proactively identify vulnerabilities.

Join Rise to see the full answer
Why do you believe communication is important for a Site Reliability Engineer?

Illustrate how clear communication fosters collaboration among technical teams and helps bridge gaps between engineering and operations. Highlight examples where effective communication improved project outcomes and incident responses.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
ServiceNow Hybrid 12900 Science Drive, Suite 100, Orlando, Florida, United States
Posted 11 days ago
Inclusive & Diverse
Mission Driven
Rise from Within
Diversity of Opinions
Work/Life Harmony
Empathetic
Feedback Forward
Take Risks
Collaboration over Competition
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Conferences Stipend
Paid Time-Off
Maternity Leave
Equity

Elevate your career with ServiceNow as the Director of Digital Technology for Governance, Risk, and Compliance, shaping the future of IT and security compliance.

Photo of the Rise User
Posted 5 days ago

Join Lockheed Martin as a Cyber Security Sr. to safeguard innovative solutions at Moody Air Force Base.

Join the Medical University of South Carolina as a System Engineer III in a remote role focused on enhancing identity and access management.

Photo of the Rise User

Join Companion Data Services as an Operator to enhance the performance of cutting-edge computing platforms in a fully on-site role.

Posted 9 hours ago

Join CommonSpirit Health, a major nonprofit healthcare entity, as a Principal Network Engineer, where innovation meets compassionate care.

Photo of the Rise User

Ralph Lauren Corporation is looking for a Business Analyst skilled in retail inventory management and data analytics to enhance Allocation functionalities.

Photo of the Rise User

Join Peraton as a Cable/Infrastructure Technician, where you'll play a key role in supporting the Specialized Operation Command's IT services.

Become a key player in automating and securing business systems at Lakeland Care Plus, working with innovative technologies in a remote environment.

MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
April 11, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY
Photo of the Rise User
80 people applied to Cybersecurity Intern at Dewberry
Photo of the Rise User
Someone from OH, Alliance just viewed Store Representative - Mid-Shift at Serv-U-Success
Photo of the Rise User
Someone from OH, Eastlake just viewed (REMOTE) Account Executive at Trellis
Photo of the Rise User
12 people applied to Junior Security Engineer at Epic
Photo of the Rise User
Someone from OH, Elyria just viewed Security Officer - Factory Patrol at Allied Universal
C
14 people applied to ISSE/ ISSO at Centuria
Photo of the Rise User
Someone from OH, Cincinnati just viewed Staff Software Test Engineer, Platform at Clari
Photo of the Rise User
Someone from OH, Perrysburg just viewed Sourcing Leader, Minerals & Cullet at Owens Corning
Photo of the Rise User
Someone from OH, North Royalton just viewed Remote AI Voice Trainer (High-Quality Microphone Required) at Datadog
C
Someone from OH, Akron just viewed Phlebotomy Technician - Outpatient at CCF