Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Staff Site Reliability Engineer image - Rise Careers
Job details

Staff Site Reliability Engineer

Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated,  purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.

Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role:

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a pivotal role in ensuring the reliability and performance of our infrastructure. SRE at Crusoe is dedicated to detecting, analyzing, and preventing issues to maintain high Service Level Agreement through Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Through automation and proactive remediation, our SREs not only resolve common errors automatically but also advise various engineering teams in building resilient code. We prioritize anticipating and resolving issues before they impact our customers, conducting thorough post-mortems, and driving continuous improvement. Our customer-centric approach ensures that clients always have access to the virtual machines they depend on. Join us to help build and maintain the robust systems that power Crusoe's innovative solutions.

A Day in the Life:

As a Site Reliability Engineer at Crusoe Energy Systems, your day begins with a review of overnight alerts and system performance metrics to ensure everything is running smoothly. You will collaborate with your team in a morning stand-up meeting to discuss ongoing projects, recent incidents, and priorities for the day. Your tasks might include automating routine processes, analyzing system logs, and developing tools to enhance our monitoring capabilities. You'll spend part of your day working closely with software engineers, advising on best practices for resilient code and reviewing changes before deployment. Regularly, you will engage in incident response drills, post-mortems, and root cause analysis sessions to learn from past issues and prevent future ones. Throughout the day, you will stay focused on maintaining high SLIs and SLOs, ensuring that our infrastructure remains robust and reliable for our customers. By day's end, you will document your work, share insights with your team, and plan for the next day's challenges, always with a customer-centric mindset.


You Will Thrive In This Role If:

  • 8+ years of professional SRE experience

  • 8+ years of experience contributing to architecture and design (architecture, design patterns, reliability and scaling) of new and current systems

  • Bachelor's Degree in Computer Science or related field, or 10+ years relevant work experience

  • Solid understanding of infrastructure design, including the operational trade-offs of various designs

  • Experience writing high quality code with at least one programming language (Python, Go, or similar)

  • Experience building with modern infrastructure tools such as Docker, Kubernetes, Ansible, Cloud Formation, Terraform

  • Experience building with modern CI/CD practices and build systems, such as GitLab CI/CD, CircleCI, GitHub Actions

  • Experience with logging, monitoring and alerting systems and tools

  • Experience with Unix/Linux environments

  • Experience with TCP/IP and network programming

  • Experience with information security best practices

  • Excellent communication skills

  • Must be able to pass a background check

  • Embody the Company values

Benefits:

  • Hybrid work schedule

  • Industry competitive pay

  • Restricted Stock Units in a fast growing, well-funded technology company

  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

  • Employer contributions to HSA accounts 

  • Paid Parental Leave 

  • Paid life insurance, short-term and long-term disability 

  • Teladoc 

  • 401(k) with a 100% match up to 4% of salary

  • Generous paid time off and holiday schedule

  • Cell phone reimbursement

  • Tuition reimbursement

  • Subscription to the Calm app

  • MetLife Legal

  • Company paid commuter benefit; $50 per pay period

Compensation Range:

Compensation will be paid up to $250,000 base salary. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Crusoe Glassdoor Company Review
3.4 Glassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon Glassdoor star icon
Crusoe DE&I Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of Crusoe
Crusoe CEO photo
Chase Lochmiller
Approve of CEO

Average salary estimate

$225000 / YEARLY (est.)
min
max
$200000K
$250000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Staff Site Reliability Engineer, Crusoe

At Crusoe Energy Systems, we’re on the frontline of the AI cloud revolution, and we’re searching for a Staff Site Reliability Engineer to join our vibrant team in San Francisco. If you have a passion for ensuring robust, reliable systems and an eye for proactive solution crafting, this role could be perfect for you! As a key player within our Site Reliability Engineering team, your expertise will drive the reliability and performance that Fortune 500 companies depend on. This isn’t just about fixing problems; it’s about preventing them before they affect our customers. You’ll find your days filled with collaboration, creativity, and continuous improvement. After kicking off the day by reviewing performance metrics and alerts, you’ll engage with talented engineers, automating routine processes and enhancing system monitoring tools. Your deep understanding of system architecture and coding, especially in languages like Python or Go, will guide various teams in developing resilient applications. You'll relish in being part of incident response drills and post-mortems, striving for excellence while fostering a culture of learning. Crusoe values its team members and offers competitive pay, hybrid work options, and comprehensive benefits including stock options, health insurance, and generous paid time off. Join us as we redefine AI cloud infrastructure and contribute to a sustainable future of computing!

Frequently Asked Questions (FAQs) for Staff Site Reliability Engineer Role at Crusoe
What responsibilities does a Staff Site Reliability Engineer have at Crusoe Energy Systems?

The Staff Site Reliability Engineer at Crusoe Energy Systems is responsible for ensuring the reliability and performance of our cloud infrastructure. This role involves detecting, analyzing, and preventing issues, maintaining high Service Level Agreements through Service Level Indicators and Service Level Objectives. SREs at Crusoe automate common errors, advise engineering teams on building resilient code, and conduct post-mortems for continuous improvement in our systems.

Join Rise to see the full answer
What qualifications are needed for the Staff Site Reliability Engineer position at Crusoe?

To qualify for the Staff Site Reliability Engineer role at Crusoe Energy Systems, candidates should have 8+ years of professional SRE experience, a strong background in system architecture and design, and excellent programming skills in languages like Python or Go. A bachelor’s degree in Computer Science or related field, or at least 10 years of relevant experience, is also required, along with solid understanding of infrastructure design and modern CI/CD practices.

Join Rise to see the full answer
What is the work culture like for a Staff Site Reliability Engineer at Crusoe Energy Systems?

At Crusoe Energy Systems, the work culture for a Staff Site Reliability Engineer is collaborative and innovative. Team members engage in daily stand-ups to discuss projects and incidents while prioritizing a customer-centric approach. The environment encourages automation and proactive problem solving, fostering continuous improvement and learning through regular post-mortems and incident response drills, making it an ideal place for professionals dedicated to reliability.

Join Rise to see the full answer
How does Crusoe Energy Systems support professional development for Staff Site Reliability Engineers?

Crusoe Energy Systems actively supports professional development for Staff Site Reliability Engineers through various avenues, including tuition reimbursement, access to the Calm app for mental well-being, and a company culture that encourages learning and growth. The hybrid work schedule also allows for remote learning opportunities while ensuring team members can engage collaboratively when necessary.

Join Rise to see the full answer
What benefits does Crusoe Energy Systems offer to Staff Site Reliability Engineers?

Crusoe Energy Systems offers an extensive benefits package for Staff Site Reliability Engineers, including competitive pay, hybrid work flexibility, restricted stock units, comprehensive health insurance options, and a 401(k) plan with a company match. They also foster well-being through paid parental leave, generous paid time off, and reimbursement for cell phone and tuition expenses, making it a rewarding workplace.

Join Rise to see the full answer
Common Interview Questions for Staff Site Reliability Engineer
Can you explain a recent incident you handled as a Site Reliability Engineer?

When discussing a recent incident you've managed, be sure to describe the situation thoroughly: what the incident was, how you identified it, the steps you took for remediation, and what the outcome was. Emphasize your analytical mindset and teamwork in addressing the issue, and don’t forget to mention lessons learned and any preventive measures you implemented afterward.

Join Rise to see the full answer
How do you prioritize tasks as a Staff Site Reliability Engineer?

Prioritize tasks by first assessing the impact and urgency of each issue. Discuss your methods for evaluating SLIs and SLOs and how you use these metrics to inform priorities. It’s essential to explain how you communicate with your team and stakeholders to ensure the most critical problems are addressed promptly while maintaining regular workflows.

Join Rise to see the full answer
What tools do you typically use for monitoring and logging?

Highlight your familiarity with various monitoring and logging tools you've used, such as Prometheus, Grafana, ELK Stack, or Splunk. Provide specific examples of how you’ve used these tools to enhance system reliability, track performance metrics, and identify potential issues proactively in previous roles.

Join Rise to see the full answer
Can you walk us through your process for conducting a post-mortem?

Refer to your structured approach for conducting post-mortems. Outline how you gather data, identify root causes, and involve team members in discussions to ensure everyone’s insights are considered. Highlight that you focus on actionable outcomes and a culture of learning so that the team grows and prevents similar incidents from occurring.

Join Rise to see the full answer
What is your experience with automation in SRE?

Discuss specific automation tools you’ve utilized, such as Terraform or Ansible, and describe instances where automation significantly improved reliability or efficiency. Be prepared to explain how you identify processes to automate and the impact it had on your teams’ performance.

Join Rise to see the full answer
How do you ensure security best practices are followed in system design?

Talk about your proactive approach to security within infrastructure design. Mention specific practices like threat modeling, regular security audits, and adhering to secure coding standards. Illustrate your experience in collaborating with security teams to ensure all components fulfill compliance requirements while maintaining system performance.

Join Rise to see the full answer
Describe your experience with Docker and Kubernetes.

Outline your familiarity with Docker and Kubernetes, including containerization practices and orchestration. Share specific projects where you implemented these technologies and describe the benefits they provided in terms of scalability and reliability for applications managed within those environments.

Join Rise to see the full answer
What programming languages are you most comfortable with?

When asked about programming languages, specify which languages you are proficient in, such as Python or Go. Provide examples of projects or tasks where you've utilized these languages effectively, emphasizing your understanding of both development and operational trade-offs in SRE.

Join Rise to see the full answer
How do you handle conflicts within your team as a Site Reliability Engineer?

Explain your approach to conflict resolution, which often involves open communication and understanding differing perspectives. Share examples of how you’ve facilitated discussions that lead to shared solutions, while maintaining respect for all team members and focusing on how to improve operations together.

Join Rise to see the full answer
What do you think is the most important quality for a Staff Site Reliability Engineer to possess?

Identify qualities like problem-solving skills, attention to detail, and strong communication as key attributes. Discuss how a strong SRE not only needs technical expertise but also the ability to collaborate effectively, anticipate issues, and drive initiatives continuously to improve system reliability and performance.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Posted 9 days ago
Photo of the Rise User
Posted 5 days ago
Rapta Inc Hybrid No location specified
Posted 3 days ago
Photo of the Rise User
Posted 6 days ago
Photo of the Rise User
TechFlow, Inc. Hybrid No location specified
Posted 3 days ago
Photo of the Rise User
Posted 5 days ago

We’re on a mission to align the future of computation with the future of the climate.

177 jobs
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
January 8, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!