Job details

Senior Site Reliability Engineer

About

We're the San Francisco Compute Company. We're building the first real-time trading platform for compute. Everyone from startups to enterprises to research labs and individuals can buy and sell compute, from 1 to 1000+ nodes for an hour to over a year. With our liquid market to resell unused compute hours, buyers no longer need to worry about contract lock-in and providers have less idle nodes. Over the next decade, we anticipate thousands of enterprises, governments, startups, and labs will be training and serving large models, and we’re building a team to scale our market.

About the Role

ML training clusters are some of the most high performance computers on the planet. Even relatively small clusters would have been in the TOP500 5 years ago. Our supercomputing team is responsible for keeping our compute clusters running smoothly, monitoring hardware health, and fixing things when they go wrong. We believe strongly in automation — code is the only reliable way to manage hardware at scale. As we scale, this will become a more data-driven role, predicting failures before they happen. We’re a small team, so you’ll be spending time talking to customers as well.

About You

You’ve managed at least one GPU training cluster in the past (ideally a cluster with >1k GPU’s but not required)
You appreciate and value good documentation
You have experience provisioning and managing Kubernetes clusters
You deeply understand Linux, networking fundamentals, CUDA, NCCL, and Infiniband
You enjoy creating large self-correcting systems that keep hardware humming
You meet at least two of the nice-to-haves below

Some Nice to Haves

Experience with Go or Rust (>2 years)
Experience with distributed storage systems (Weka, VAST, Ceph, etc.)
Experience with HPC network architectures (eBGP, fat-tree, VXLAN, MCLAG, etc.)
Experience with Linux virtualization (KVM, QEMU, libvirt, etc.)
Experience with performance optimization of machine learning kernels

Benefits

Unlimited office book budget

You can buy as many books for the office as you want. You’re encouraged to spend time during the workday reading!

Generous equity grant

Team members are offered a competitive salary along with equity in the company

Retirement matching

We match 401(k) plans up to 4%

Medical, dental & vision

We offer competitive medical, dental, vision insurance for employees and dependents and cover 100% of premiums

Time off

We offer unlimited paid time off as well as 10+ observed holidays

Parental leave

We offer biological, adoptive, and foster parents paid time off to spend quality time with family

Daily lunch

We cover lunch daily for employees

Visa Sponsorships

Yes, we sponsor visas and work permits

The San Francisco Compute Company is committed to maintaining a workplace free from discrimination and harassment.

We make employment decisions based on business needs, job requirements, and individual qualifications, without regard to race, color, religion, belief, national origin, social or ethical origin, age, physical, mental, or sensory disability, sexual orientation, gender identity or expression, marital status, civil union or domestic partnership status, past or present military service, HIV status, family medical history or genetic information, family or parental status including pregnancy, or any other status protected by law.

We welcome the opportunity to consider qualified applicants with prior arrest or conviction records. Our commitment to diversity includes hiring talented individuals regardless of their criminal history, in accordance with local, state, and federal laws, including San Francisco’s Fair Chance Ordinance and California’s ban-the-box laws.

If you require reasonable accommodation for any reason, please reach out to us at team@sfcompute.com.

Average salary estimate

$140000 / YEARLY (est.)

min

max

$120000K

$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Senior Site Reliability Engineer, The San Francisco Compute Company

At San Francisco Compute Company, we're on the cutting edge of technology, ready to revolutionize the computing market with our innovative real-time trading platform. As a Senior Site Reliability Engineer, you'll play a pivotal role in maintaining the pristine performance of our ML training clusters, which are among the most powerful on the planet. This isn’t just about keeping the lights on; it's about ensuring our supercomputing team runs like a well-oiled machine, making full use of automation to enhance the management of hardware at scale. We're looking for someone who has hands-on experience managing high-performance GPU training clusters, ideally with a robust understanding of Linux, networking fundamentals, and technologies like CUDA and NCCL. You'll love our collaborative environment where talking to customers is as vital as the technical work you do. With responsibilities that lean into predictive analytics and troubleshooting, this role is perfect for someone who thrives on creating systems that not only perform well but can self-correct, minimizing downtime and maximizing efficiency. Beyond the tech, you’ll be part of a company that believes in your growth. We offer unlimited office book budgets, generous equity grants, a 401(k) match, comprehensive health benefits, and a culture that values your time off. Come join us in transforming the computing landscape for enterprises and individuals alike, all while enjoying a work environment that champions continuous learning and inclusivity.

Frequently Asked Questions (FAQs) for Senior Site Reliability Engineer Role at The San Francisco Compute Company

What responsibilities does a Senior Site Reliability Engineer have at San Francisco Compute Company?

As a Senior Site Reliability Engineer at San Francisco Compute Company, you'll be responsible for maintaining the high performance of our ML training clusters. This includes monitoring hardware health, troubleshooting issues, automating hardware management, and communicating directly with customers. Your role will evolve to become more data-driven as you work to predict failures before they occur, ensuring our systems operate smoothly and efficiently.