Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Senior Site Reliability Engineer image - Rise Careers
Job details

Senior Site Reliability Engineer

About

We're the San Francisco Compute Company. We're building the first real-time trading platform for compute. Everyone from startups to enterprises to research labs and individuals can buy and sell compute, from 1 to 1000+ nodes for an hour to over a year. With our liquid market to resell unused compute hours, buyers no longer need to worry about contract lock-in and providers have less idle nodes. Over the next decade, we anticipate thousands of enterprises, governments, startups, and labs will be training and serving large models, and we’re building a team to scale our market.

About the Role

ML training clusters are some of the most high performance computers on the planet. Even relatively small clusters would have been in the TOP500 5 years ago. Our supercomputing team is responsible for keeping our compute clusters running smoothly, monitoring hardware health, and fixing things when they go wrong. We believe strongly in automation — code is the only reliable way to manage hardware at scale. As we scale, this will become a more data-driven role, predicting failures before they happen. We’re a small team, so you’ll be spending time talking to customers as well.

About You

  • You’ve managed at least one GPU training cluster in the past (ideally a cluster with >1k GPU’s but not required)

  • You appreciate and value good documentation

  • You have experience provisioning and managing Kubernetes clusters

  • You deeply understand Linux, networking fundamentals, CUDA, NCCL, and Infiniband

  • You enjoy creating large self-correcting systems that keep hardware humming

  • You meet at least two of the nice-to-haves below

Some Nice to Haves

  • Experience with Go or Rust (>2 years)

  • Experience with distributed storage systems (Weka, VAST, Ceph, etc.)

  • Experience with HPC network architectures (eBGP, fat-tree, VXLAN, MCLAG, etc.)

  • Experience with Linux virtualization (KVM, QEMU, libvirt, etc.)

  • Experience with performance optimization of machine learning kernels

Benefits

Unlimited office book budget

You can buy as many books for the office as you want. You’re encouraged to spend time during the workday reading!

Generous equity grant

Team members are offered a competitive salary along with equity in the company

Retirement matching

We match 401(k) plans up to 4%

Medical, dental & vision

We offer competitive medical, dental, vision insurance for employees and dependents and cover 100% of premiums

Time off

We offer unlimited paid time off as well as 10+ observed holidays

Parental leave

We offer biological, adoptive, and foster parents paid time off to spend quality time with family

Daily lunch

We cover lunch daily for employees

Visa Sponsorships

Yes, we sponsor visas and work permits

The San Francisco Compute Company is committed to maintaining a workplace free from discrimination and harassment.

We make employment decisions based on business needs, job requirements, and individual qualifications, without regard to race, color, religion, belief, national origin, social or ethical origin, age, physical, mental, or sensory disability, sexual orientation, gender identity or expression, marital status, civil union or domestic partnership status, past or present military service, HIV status, family medical history or genetic information, family or parental status including pregnancy, or any other status protected by law.

We welcome the opportunity to consider qualified applicants with prior arrest or conviction records. Our commitment to diversity includes hiring talented individuals regardless of their criminal history, in accordance with local, state, and federal laws, including San Francisco’s Fair Chance Ordinance and California’s ban-the-box laws.

If you require reasonable accommodation for any reason, please reach out to us at team@sfcompute.com.

Average salary estimate

$140000 / YEARLY (est.)
min
max
$120000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Senior Site Reliability Engineer, The San Francisco Compute Company

At San Francisco Compute Company, we're on the cutting edge of technology, ready to revolutionize the computing market with our innovative real-time trading platform. As a Senior Site Reliability Engineer, you'll play a pivotal role in maintaining the pristine performance of our ML training clusters, which are among the most powerful on the planet. This isn’t just about keeping the lights on; it's about ensuring our supercomputing team runs like a well-oiled machine, making full use of automation to enhance the management of hardware at scale. We're looking for someone who has hands-on experience managing high-performance GPU training clusters, ideally with a robust understanding of Linux, networking fundamentals, and technologies like CUDA and NCCL. You'll love our collaborative environment where talking to customers is as vital as the technical work you do. With responsibilities that lean into predictive analytics and troubleshooting, this role is perfect for someone who thrives on creating systems that not only perform well but can self-correct, minimizing downtime and maximizing efficiency. Beyond the tech, you’ll be part of a company that believes in your growth. We offer unlimited office book budgets, generous equity grants, a 401(k) match, comprehensive health benefits, and a culture that values your time off. Come join us in transforming the computing landscape for enterprises and individuals alike, all while enjoying a work environment that champions continuous learning and inclusivity.

Frequently Asked Questions (FAQs) for Senior Site Reliability Engineer Role at The San Francisco Compute Company
What responsibilities does a Senior Site Reliability Engineer have at San Francisco Compute Company?

As a Senior Site Reliability Engineer at San Francisco Compute Company, you'll be responsible for maintaining the high performance of our ML training clusters. This includes monitoring hardware health, troubleshooting issues, automating hardware management, and communicating directly with customers. Your role will evolve to become more data-driven as you work to predict failures before they occur, ensuring our systems operate smoothly and efficiently.

Join Rise to see the full answer
What qualifications are necessary to apply for the Senior Site Reliability Engineer position at San Francisco Compute Company?

To apply for the Senior Site Reliability Engineer position at San Francisco Compute Company, candidates should have experience managing GPU training clusters, preferably with knowledge of clusters exceeding 1,000 GPUs. A strong understanding of Linux, Kubernetes, and networking fundamentals is essential. Additionally, familiarity with technologies like CUDA, NCCL, and Infiniband will be advantageous. Nice-to-haves include experience with Go or Rust and distributed storage systems.

Join Rise to see the full answer
How does San Francisco Compute Company support the professional growth of Senior Site Reliability Engineers?

San Francisco Compute Company is committed to the professional growth of its Senior Site Reliability Engineers. We offer an unlimited office book budget, ensuring that you always have access to the latest knowledge in your field. Additionally, our generous equity grants and comprehensive benefits package, including retirement matching and paid time off, reflect our dedication to your long-term success and well-being.

Join Rise to see the full answer
What technologies should a Senior Site Reliability Engineer be proficient in at San Francisco Compute Company?

A Senior Site Reliability Engineer at San Francisco Compute Company should be proficient in managing Kubernetes clusters, understanding Linux deeply, and possessing knowledge of networking fundamentals. Familiarity with GPU computing technologies like CUDA, NCCL, and Infiniband is crucial. Additional proficiency in distributed storage systems and Linux virtualization can benefit your role significantly.

Join Rise to see the full answer
What is the work culture like for a Senior Site Reliability Engineer at San Francisco Compute Company?

The work culture at San Francisco Compute Company is collaborative and innovative, emphasizing continuous learning and inclusivity. Senior Site Reliability Engineers are encouraged to communicate regularly with customers and seek out resources that enhance their technical expertise. The company fosters a supportive environment where your ideas are valued and you can enjoy workplace perks like unlimited time off and daily lunches.

Join Rise to see the full answer
Common Interview Questions for Senior Site Reliability Engineer
What experience do you have managing GPU training clusters?

When answering this question, share specific details about the GPU clusters you’ve managed, including the size and technology used. Highlight any challenges you faced and how you overcame them, focusing on your problem-solving abilities and your understanding of the complexities involved in cluster management.

Join Rise to see the full answer
How do you ensure the reliability of high-performance compute clusters?

Discuss your approach to reliability, including monitoring tools you use, best practices for maintenance, and your strategies for automation. It's important to convey your proactive mindset and how you prioritize minimizing downtime while optimizing performance.

Join Rise to see the full answer
Can you describe your experience with Kubernetes?

In your answer, provide an overview of your hands-on experience with Kubernetes, detailing how you’ve used it for deploying and managing applications in production environments. Cite specific examples that illustrate your understanding of orchestration, scaling, and troubleshooting.

Join Rise to see the full answer
What steps would you take to diagnose a failure in a compute cluster?

Outline a methodical approach to diagnosing cluster failures. Include steps such as checking logs, analyzing system metrics, and using monitoring tools to identify bottlenecks. Emphasize your analytical skills and experience with similar situations to provide reassurance to the interviewer.

Join Rise to see the full answer
Describe a project where you implemented automation to improve performance.

Explain a specific project where your automation efforts led to measurable improvements in performance or reliability. Describe the tools used, the challenges you faced, and how your solution made an impact, demonstrating your ability to innovate within the role.

Join Rise to see the full answer
How do you keep up with the latest technologies in site reliability engineering?

Share your methods for staying current with industry trends, such as attending conferences, participating in online forums, or taking courses. Emphasize your passion for continuous learning and how it enhances your ability to contribute to the team.

Join Rise to see the full answer
What have you learned from managing documentation for infrastructure?

Discuss the importance of good documentation in ensuring clear communication and effective operations within technical teams. Share specific examples where documentation played a critical role in onboarding, troubleshooting, or improving processes.

Join Rise to see the full answer
What role does customer communication play in your position?

Emphasize the significance of customer communication in understanding their needs and improving system reliability. Share examples of how direct feedback has led to enhancements in your previous roles, showcasing your customer-centric approach.

Join Rise to see the full answer
Can you provide an example of performance optimization you conducted?

Be prepared to discuss a specific instance where you successfully optimized performance, detailing the problem, the steps you took, and the results achieved. This demonstrates your technical skills and your commitment to achieving excellence.

Join Rise to see the full answer
What is your experience with distributed storage systems?

If applicable, share your experience with different distributed storage systems, detailing your role in implementing or managing them. Discuss the challenges you faced and how you effectively addressed them to enhance system reliability and performance.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Coco Hybrid los angeles
Posted 7 days ago
Photo of the Rise User
Mission Driven
Social Impact Driven
Passion for Exploration
Reward & Recognition
Photo of the Rise User
AECOM Remote NEOM, TABUK, Saudi Arabia
Posted 9 days ago
Photo of the Rise User
Posted 11 days ago
Posted 9 days ago

A large, low-cost H100 cluster you can rent by the hour

3 jobs
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
December 26, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!