Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
SRE Manager image - Rise Careers
Job details

SRE Manager

Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated,  purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.

Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role:

As the SRE Manager, you will lead the creation and operation of a 24/7 Site Reliability Engineering team. Your primary goal is to ensure continuous availability and optimal performance of our cloud infrastructure, providing customers with uninterrupted access to their GPUs. You will design and implement advanced alerting and monitoring systems, manage incident response, and drive system improvements. Collaborating with remote teams across time zones, you will prioritize projects and streamline workflows to achieve rapid results. This role offers the opportunity to significantly impact the reliability of our cutting-edge cloud services and drive the success of our team.

A Day in the Life:

As a Site Reliability Engineering Manager at Crusoe Energy Systems, your day is a blend of people management and operational oversight. Your morning starts with one-on-one meetings and team stand-ups, focusing on guidance, support, and aligning daily goals. You'll spend about 40% of your time on team development, strategic planning, and fostering a collaborative environment.

The remaining 60% is dedicated to operational tasks, such as reviewing performance metrics, overseeing incident responses, and driving automation projects. You ensure high SLIs and SLOs while resolving technical issues and optimizing processes. By day's end, you review project progress and plan the next steps, maintaining a high-performing, customer-centric SRE organization.

You Will Thrive In This Role If:

  • You have at least 3 years of experience with building and managing a 24/7 technical support team in a cloud operations environment.

  • You have a strong background in Linux, containerization technologies, and Kubernetes. You understand virtualization and cloud computing concepts.

  • You have worked with Prometheus, Victoria Metrics, exporters, against bare-metal endpoints

  • You have some experience with Infrastructure as it relates to Data Center Operations.

  • You’re interested in playing a key role in talent acquisition and retention. This includes diligent performance management and coaching/developing your team according to their individual needs.

  • You’ve developed training programs for new hires and ongoing professional development opportunities for your team members.

  • You like the idea of serving as a technical escalation point and ensuring the highest quality of support. You have experience with Implementing quality assurance measures.

  • You have supported, monitored, and handled Service Level Agreements (SLAs) for a variety of categories that enable an end customer

  • You have used technologies such as RabbitMQ, Kafka, Temporal, NATs

  • You can produce solid solutions in Golang or Python

  • You’re strategic about tracking and reporting KPIs, with a focus on team performance and customer satisfaction. You’ve played a big part in the strategic planning for a team’s growth and scalability.

  • You like the idea of working with other departments to align on technical escalations, live incidents, customer needs, and feedback.

  • Leadership & Communication: Demonstrated leadership ability and excellent communication skills.

  • Problem-Solving & Adaptability: Robust problem-solving skills and adaptability in a fast-paced environment.

  • Project Management: Experience with project management tools and methodologies.

  • Embody the Company values

Benefits: 

  • Hybrid work schedule

  • Competitive Paid Time Off

  • Industry competitive pay

  • Retirement benefits

  • Healthcare benefits including Medical, Dental, and Vision

  • Short and Long-Term Disability Insurance

  • Life Insurance

  • Paid Parental Leave

  • Subscription to Calm App

Compensation Range

Compensation will be paid as salary. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Crusoe Glassdoor Company Review
3.4 Glassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon Glassdoor star icon
Crusoe DE&I Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of Crusoe
Crusoe CEO photo
Chase Lochmiller
Approve of CEO

Average salary estimate

$140000 / YEARLY (est.)
min
max
$120000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About SRE Manager, Crusoe

At Crusoe, an innovative leader in AI-first cloud infrastructure, we're on a quest to build the World's Favorite cloud solutions, and we need a talented SRE Manager in San Francisco to join our dynamic team. You will play a pivotal role in leading a dedicated Site Reliability Engineering team that operates 24/7, ensuring our clients enjoy seamless access to our cutting-edge AI-driven services. Your mission is to create and maintain high-quality, reliable cloud infrastructure that powers the most advanced AI applications, trusted by Fortune 500 companies. In this role, you'll dive into designing top-notch monitoring and alerting systems while managing incident responses. You're not just keeping the lights on; you're driving innovation and making a tangible impact while collaborating with cross-functional teams. You will spend your days balancing team management and operational oversight, nurturing your team's growth while maintaining our SLIs and SLOs. Your technical expertise in Linux, Kubernetes, and cloud concepts will shine as you create strategic plans that foster collaboration across departments. Plus, with Crusoe's commitment to sustainable technology, you’ll be part of a mission that harmonizes the future of computing with the climate. If you're passionate about building and leading high-performing teams, and you relish the challenge of transforming cloud infrastructure, this is your chance to thrive and be part of the AI revolution at Crusoe!

Frequently Asked Questions (FAQs) for SRE Manager Role at Crusoe
What are the main responsibilities of the SRE Manager at Crusoe?

The SRE Manager at Crusoe is responsible for leading a 24/7 Site Reliability Engineering team, ensuring continuous availability and performance of cloud infrastructure. This includes designing monitoring systems, managing incident responses, and driving system improvements to meet customer needs.

Join Rise to see the full answer
What qualifications are required for the SRE Manager role at Crusoe?

To qualify for the SRE Manager position at Crusoe, candidates should have at least three years of experience managing a technical support team in a cloud operations environment, along with a solid background in Linux, containerization technologies, Kubernetes, and experience with monitoring tools such as Prometheus.

Join Rise to see the full answer
How does the SRE Manager contribute to team development at Crusoe?

The SRE Manager plays a crucial role in team development by providing mentorship, creating training programs for new hires, and offering ongoing professional development opportunities. They also manage performance to align with individual team member needs.

Join Rise to see the full answer
What technologies should an SRE Manager be familiar with for Crusoe?

An SRE Manager at Crusoe should be familiar with technologies such as RabbitMQ, Kafka, Infrastructure as it relates to Data Center Operations, and be capable of producing effective solutions using Golang or Python.

Join Rise to see the full answer
What is the team culture like for the SRE Manager role at Crusoe?

The culture for the SRE Manager position at Crusoe is collaborative and innovative, focusing on maintaining high performance and customer satisfaction. Regular team stand-ups, strategic planning, and open communication are key components of creating a supportive environment.

Join Rise to see the full answer
Common Interview Questions for SRE Manager
Can you describe your experience with incident management as an SRE Manager?

Emphasize your hands-on experience managing incidents, discussing specific challenges you faced, how you coordinated the response, and the ultimate outcomes. Share examples of improving processes to prevent similar situations in the future.

Join Rise to see the full answer
How do you ensure high SLIs and SLOs in your teams?

Discuss your strategic approach to monitoring and measuring service performance. Explain how you engage your team in setting realistic goals and how you track progress to consistently meet or exceed those expectations.

Join Rise to see the full answer
What strategies do you use for team coaching and development?

Highlight your methods for assessing team members' skills, creating personalized development plans, and fostering growth through training, mentorship, and performance feedback that aligns with the company's goals.

Join Rise to see the full answer
Describe a challenge you faced in a previous SRE role and how you overcame it.

Provide a detailed narrative of a specific challenge, outlining the steps you took to address the issue—including collaboration with other departments—and discuss the successful resolution and lessons learned.

Join Rise to see the full answer
How do you approach cross-functional team collaboration?

Share your experiences working with various departments, detailing how you build relationships, communicate effectively, and align technical actions to meet customer needs while ensuring clarity and teamwork.

Join Rise to see the full answer
What tools do you use for monitoring and alerting in your SRE operations?

Mention the specific tools you have experience with, such as Prometheus or Victoria Metrics, and explain how you've implemented them to enhance system performance and respond efficiently to alerts.

Join Rise to see the full answer
How do you handle performance under pressure?

Discuss your strategies for staying calm in high-pressure situations, managing priorities, and leveraging your team's strengths to effectively resolve issues quickly and efficiently.

Join Rise to see the full answer
What is your approach to quality assurance in SRE?

Explain your process for implementing quality assurance measures within your team, including establishing best practices, reviewing performance metrics, and ensuring that your team's outputs consistently meet quality standards.

Join Rise to see the full answer
Can you elaborate on your experience with cloud infrastructure?

Share your background in cloud infrastructure by discussing specific projects you've led, technologies you've used, and how you’ve contributed to improving systems and customer experiences within that domain.

Join Rise to see the full answer
How do you track and report KPIs for your SRE team?

Discuss the KPIs you focus on, the tools you use for tracking them, and how you report progress to stakeholders, emphasizing the importance of this data in driving your team's performance and improving customer satisfaction.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Posted 6 days ago
Photo of the Rise User
Posted 8 days ago
Photo of the Rise User
Posted 2 days ago
Photo of the Rise User
AECOM Remote Centurion, South Africa
Posted 23 hours ago
Photo of the Rise User
Posted 12 days ago
Photo of the Rise User
Posted 9 days ago
Photo of the Rise User
Intuitive Hybrid Sunnyvale, CA
Posted 3 days ago
TymeX Remote No location specified
Posted 5 days ago
SSC Egypt Remote No location specified
Posted 13 days ago

We’re on a mission to align the future of computation with the future of the climate.

155 jobs
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
December 7, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!