Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
AI Infrastructure Operations Engineer image - Rise Careers
Job details

AI Infrastructure Operations Engineer

Cerebras Systems is seeking an experienced AI Infrastructure Operations Engineer who will manage and operate large-scale AI compute infrastructure to support transformative AI applications. We're looking for someone proficient in machine learning context and high-performance computing.

Skills

  • Linux systems
  • Python scripting
  • Docker and Kubernetes
  • Monitoring systems
  • Strong communication skills

Responsibilities

  • Manage and operate multiple advanced AI compute infrastructure clusters.
  • Monitor and oversee cluster health and resolve potential issues.
  • Maximize compute capacity through optimization and efficient resource allocation.
  • Deploy and configure container-based services using Docker.
  • Provide 24/7 monitoring and support, troubleshooting as needed.
  • Collaborate with cross-functional teams to solve complex technical challenges.

Education

  • Bachelor's degree in Computer Science, Engineering or related field

Benefits

  • Non-corporate work culture
  • Job stability with startup vitality
  • Opportunity to publish cutting-edge research
To read the complete job description, please click on the ‘Apply’ button

Average salary estimate

$120000 / YEARLY (est.)
min
max
$100000K
$140000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About AI Infrastructure Operations Engineer, Cerebras Systems

Cerebras Systems is on the lookout for a talented AI Infrastructure Operations Engineer to join our innovative team located in Sunnyvale, CA or Toronto, Canada. As the creators of the world's largest AI chip, we take immense pride in our unrivaled wafer-scale architecture that brings the compute power of dozens of GPUs into a single device. In this exciting role, you will manage and operate advanced AI compute infrastructure clusters, directly impacting the efficiency and performance of our cutting-edge technology. Your expertise will allow you to monitor cluster health and ensure our infrastructures are in top shape, ultimately maximizing compute capacity to support groundbreaking AI applications. With responsibilities spanning from troubleshooting distributed systems to deploying container-based services using Docker, you will play a vital part in our company’s mission. The ideal candidate brings 6-8 years of relevant experience in complex compute infrastructure, particularly within the realm of machine learning or high-performance computing. At Cerebras, we cultivate a collaborative environment where innovation thrives with a commitment to customer success. Join us at the forefront of AI advancements and help shape the future of machine learning technology by applying your skills and passion in a fast-paced, supportive, and growth-oriented culture.

Frequently Asked Questions (FAQs) for AI Infrastructure Operations Engineer Role at Cerebras Systems
What are the primary responsibilities of an AI Infrastructure Operations Engineer at Cerebras Systems?

As an AI Infrastructure Operations Engineer at Cerebras Systems, you will be responsible for managing and operating our advanced AI compute infrastructure clusters. This involves monitoring cluster health, ensuring performance, maximizing compute capacity, and resolving technical challenges. You will also deploy and debug container-based services using Docker, provide around-the-clock support, and contribute to process improvement related to monitoring and support.

Join Rise to see the full answer
What qualifications are necessary for an AI Infrastructure Operations Engineer at Cerebras Systems?

To be considered for the AI Infrastructure Operations Engineer position at Cerebras Systems, you typically need 6-8 years of experience in managing complex compute infrastructure. Key qualifications include strong Linux proficiency, containerization experience (especially with Docker and Kubernetes), and Python scripting skills. Expertise in troubleshooting distributed systems and a solid understanding of monitoring tools is essential as well.

Join Rise to see the full answer
How does the work environment at Cerebras Systems support AI Infrastructure Operations Engineers?

Cerebras Systems promotes a dynamic and non-corporate work culture that encourages individual beliefs and ideas. As an AI Infrastructure Operations Engineer, you will find a supportive team focused on continuous learning and problem-solving. The collaborative environment at Cerebras allows you to thrive while tackling complex challenges in a rapidly evolving industry, ensuring that your contributions to AI infrastructure development are recognized and valued.

Join Rise to see the full answer
What tools and technologies will I use as an AI Infrastructure Operations Engineer at Cerebras?

In the role of AI Infrastructure Operations Engineer at Cerebras Systems, you will utilize various tools and technologies such as Docker for container deployment, Kubernetes for orchestration, monitoring systems for infrastructure health, and Python for automation tasks. Familiarity with cloud platforms like AWS, GCP, or Azure, along with networking technologies like TCP/IP, is also beneficial.

Join Rise to see the full answer
What opportunities for growth and innovation exist for AI Infrastructure Operations Engineers at Cerebras?

At Cerebras Systems, AI Infrastructure Operations Engineers are encouraged to innovate and contribute to all aspects of AI advancements. With constant exposure to cutting-edge technologies and a supportive learning environment, you’ll have the opportunity to develop your skills, tackle unprecedented challenges in AI compute infrastructure, and make significant contributions to projects at the forefront of the industry, reinforcing your role as a key driver of technological progress.

Join Rise to see the full answer
Common Interview Questions for AI Infrastructure Operations Engineer
Can you explain your experience managing complex compute infrastructures in previous roles?

When answering this question, provide clear examples of your previous experience managing compute infrastructures. Highlight any specific projects that involved large-scale AI systems and detail the challenges you faced and how you overcame them, emphasizing your proactive problem-solving approach.

Join Rise to see the full answer
How do you ensure the reliability and performance of machine learning compute clusters?

To answer this question effectively, discuss the monitoring tools you utilize and the specific metrics you track to ensure reliable performance. Explain how you identify and respond to potential issues before they impact operations, illustrating your proactive maintenance strategies.

Join Rise to see the full answer
Describe your experience with containerization technologies like Docker.

Highlight your experiences deploying applications within Docker containers. Provide examples of how you have used Docker for resource efficiency and ease of deployment, including any relevant orchestration experience with tools like Kubernetes.

Join Rise to see the full answer
Can you detail a time when you resolved a complex technical issue in your infrastructure management?

Share a compelling story where you encountered a significant technical issue, the steps you took to troubleshoot it, and the outcome. Emphasize your analytical problem-solving skills and your ability to collaborate with different teams to find a resolution.

Join Rise to see the full answer
What do you find most exciting about working with AI technologies?

Your answer should reflect your passion for AI and your belief in its potential to transform industries. Discuss any specific aspects of AI technology that inspire you, such as breakthroughs in machine learning or opportunities to innovate with powerful hardware solutions like those at Cerebras.

Join Rise to see the full answer
How would you prioritize multiple tasks in a high-pressure environment?

Describe your organizational strategies and time management techniques. Explain how you assess the urgency and impact of tasks, ensuring critical systems remain operational while still accomplishing project-driven goals.

Join Rise to see the full answer
What strategies do you use to keep up with the latest developments in AI infrastructure technology?

Discuss your commitment to continuous learning through online courses, webinars, and industry publications. Mention how you apply this knowledge to drive innovations within your work environment, showcasing your dedication to staying current in the field.

Join Rise to see the full answer
How do you foster collaboration when working with cross-functional teams?

Convey your communication style and the steps you take to ensure all team members are aligned on objectives and challenges. Highlight any tools or practices you use to promote transparency and cooperation among team members.

Join Rise to see the full answer
Explain how you would optimize compute resources in an AI infrastructure.

Provide a systematic approach to optimizing compute resources, such as load balancing, scaling strategies, and efficient allocation of resources based on workload demands. Illustrate your methodology with examples from your past experiences.

Join Rise to see the full answer
What do you consider key metrics for monitoring the health of AI compute clusters?

Identify key metrics such as CPU and memory usage, network traffic, storage performance, and job completion times. Discuss how you would interpret these metrics to diagnose performance issues and ensure optimal functioning of the clusters.

Join Rise to see the full answer
Similar Jobs
Posted 10 days ago
Photo of the Rise User
AECOM Remote Pickering, Canada
Posted 9 days ago
Photo of the Rise User
Skyryse Hybrid Greater Los Angeles Area, CA
Posted yesterday
Wyetech Hybrid Aurora, Colorado
Posted 6 days ago
Posted 8 days ago
Photo of the Rise User
AECOM Remote Gold Coast, QLD, Australia
Posted 8 days ago
Sadaora Remote No location specified
Posted 5 days ago
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
SALARY RANGE
$100,000/yr - $140,000/yr
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
March 18, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY
Photo of the Rise User
Someone from OH, Strongsville just viewed Automotive Buyer at Sonic Automotive
Photo of the Rise User
Someone from OH, Strongsville just viewed Experienced Automotive Buyer at Sonic Automotive
Photo of the Rise User
8 people applied to Assembly Mechanic at Boeing
Photo of the Rise User
Someone from OH, Columbus just viewed Business Systems Analyst, Apps & Automations at Deel
Photo of the Rise User
Someone from OH, Findlay just viewed Marketing Analyst at ITW
Photo of the Rise User
Someone from OH, Cleveland just viewed Data Modeler, Analyst at BlackRock
R
Someone from OH, Cleveland just viewed Marketing Lead at Redi.Health
Photo of the Rise User
Someone from OH, Cleveland just viewed Data Operations Analyst at Point72
Photo of the Rise User
Someone from OH, Cleveland just viewed Associate Conversion Data Analyst at Bloomerang
Photo of the Rise User
Someone from OH, Cleveland just viewed Material Buyer/Planner at Aston Carter
F
Someone from OH, Cleveland just viewed Senior Materials Planner at Fortune Brands
Photo of the Rise User
Someone from OH, Cleveland just viewed Junior Data Analyst at Arkana Laboratories
Photo of the Rise User
Someone from OH, Cleveland just viewed BI Analyst, Junior at Emi Labs
Photo of the Rise User
Someone from OH, Cleveland just viewed Data Analyst at Qloo
Photo of the Rise User
Someone from OH, Bellbrook just viewed Accounting Co-Op (Part-Time) at Avery Dennison
Photo of the Rise User
Someone from OH, Cincinnati just viewed Senior Compliance officer (AML) at Visa
Photo of the Rise User
Someone from OH, Solon just viewed Senior Technical writer at BlackStone eIT
Photo of the Rise User
Someone from OH, Cleveland just viewed Amazon Expediting Fleet Specialist at MSX International
R
Someone from OH, Cincinnati just viewed Sales development representative at Remote Recruitment
Photo of the Rise User
Someone from OH, Cincinnati just viewed Laboratory Technologist I - 2nd Shift at Eurofins
Photo of the Rise User
Someone from OH, Independence just viewed Analyst - Customer Master Data at AECOM
Photo of the Rise User
33 people applied to REMOTE Sr Piping Designer at Kelly
Photo of the Rise User
15 people applied to Internship summer 2025 at Boeing