Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Site Reliability Engineer image - Rise Careers
Job details

Site Reliability Engineer

Voltage Park’s mission is to make AI infrastructure accessible to all. Today, we own 24,000+ H100s and operate 7+ data-centers across the US. We serve customers of all sizes, from small research labs to large enterprises. As part of this effort, we’re hiring a Site Reliability Engineer to be responsible for building out and operating our core infrastructure, including bare metal provisioning, telemetry, storage, and container / VM orchestration. 

To succeed in this role, you will need to be comfortable owning the care and feeding of thousands of GPU servers and related support infrastructure, including logging, analytics, automations, testing, and SOPs. You’ll play a pivotal role as a member of the team, responsible for bringing a substantial amount of infrastructure online across multiple data centers. You’ll also have an important role in defining the company’s culture and ensuring mission success.

This is a fully remote role, however some overlap with core PST work hours is required. You must be located in the United States, and we are unable to provide visa sponsorship at this time.

Responsibilities

  • At the direction of the Manager of Site Reliability Engineering, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features.

  • Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases.

  • Collaborate with colleagues in network engineering, software development, and customer support in a flat organization.

  • Participate in the SRE on-call rotation (1 week on, 5+ weeks off).

Qualifications

  • 8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience.

  • 5+ years experience with AWS.

  • 2+ years experience with Kubernetes and strong container fundamentals.

  • 2+ years experience with Terraform and Ansible

  • 2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systems.

  • Experience working in a Slack-first, asynchronous remote work environment.

  • Experience with monitoring systems (Prometheus, ELK stack).

  • Familiarity with the gitops workflow. 

  • Software development experience using Python, Go, bash,  or other languages for the purposes of automation & connecting systems & APIs together.

  • Deep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and Infiniband.

  • Experience architecting, building, and delivering complex systems from 0 to 1.

  • Adept at balancing pragmatic development and ideal architectures. Effective at navigating tradeoffs between design, risk, cost, and outcomes.

  • Comfortable with navigating ambiguity.

  • Strong written and oral communication.

Ideal Experiences

  • Experience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardware.

  • Experience with GPU servers, both in bare metal form or under virtualization.

  • Deep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls. 

  • Experience with VAST storage systems.

Culture

  • You enjoy working with a small group of friendly, highly motivated, execution focused colleagues.

  • You’re comfortable with a high degree of autonomy. We expect you to independently prioritize your work and understand how it maps to the overall needs and goals of the company.

  • You’re knowledgeable in your domain but also enjoy wearing multiple hats and venturing outside of your comfort zone when the need arises.

  • You value the ability to write well and understand the importance of good documentation.


Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. 

Average salary estimate

$140000 / YEARLY (est.)
min
max
$120000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Site Reliability Engineer, Voltage Park

At Voltage Park, located in vibrant San Francisco, we're on a mission to make AI infrastructure accessible to all! We currently operate 24,000+ H100 GPUs across more than 7 data centers in the U.S., serving both small research labs and large enterprises alike. We're looking for a talented Site Reliability Engineer to join our dynamic team and take charge of building and maintaining our core infrastructure. In this role, you'll dive into bare metal provisioning, telemetry, storage, and managing container/VM orchestration. If you’re passionate about ensuring that thousands of GPU servers run smoothly, including related analytics and automation processes, then this is the job for you! Your input will directly shape our infrastructure and team culture as we scale up our operations. While this role is fully remote, we do require some overlap with Pacific Standard Time work hours. Please note that candidates must be based in the United States as we cannot provide visa sponsorship at this time. Get ready to contribute significantly to Voltage Park—bringing new platforms online and collaborating with a flat organization of passionate professionals dedicated to customer satisfaction. Join us and be part of something groundbreaking in the AI realm!

Frequently Asked Questions (FAQs) for Site Reliability Engineer Role at Voltage Park
What does a Site Reliability Engineer do at Voltage Park?

As a Site Reliability Engineer at Voltage Park, you'll design and operate the core infrastructure, including dealing with bare metal provisioning and providing robust support for thousands of GPU servers. Your role involves minimizing incidents, deploying updates, and collaborating closely with network engineers and software developers.

Join Rise to see the full answer
What qualifications are needed for the Site Reliability Engineer position at Voltage Park?

Voltage Park seeks a candidate with at least 8 years of Linux experience, 5 years with AWS, and 2+ years with Kubernetes, Terraform, and Ansible. Familiarity with network attached storage, monitoring systems, and software development languages such as Python or Go is also important.

Join Rise to see the full answer
What is the work culture like for a Site Reliability Engineer at Voltage Park?

The culture at Voltage Park is one of collaboration and autonomy. You’ll work with a small, motivated team and be encouraged to prioritize your work independently, all while maintaining strong communication and documentation practices.

Join Rise to see the full answer
Is the Site Reliability Engineer job at Voltage Park remote?

Yes, the Site Reliability Engineer position is fully remote. However, candidates are required to have some overlap with Pacific Standard Time work hours as part of our collaborative environment.

Join Rise to see the full answer
What kind of projects will a Site Reliability Engineer work on at Voltage Park?

In this role, you will work on significant projects aimed at deploying and maintaining new infrastructure platforms. This includes building automation processes, managing analytics, and ensuring the reliability of customer-facing and internal features across our extensive GPU infrastructure.

Join Rise to see the full answer
What tools and technologies should a Site Reliability Engineer be familiar with at Voltage Park?

A Site Reliability Engineer at Voltage Park should be adept in tools like Kubernetes, Terraform, and monitoring systems like Prometheus and the ELK stack. Experience with network attached storage and programming languages for automation—such as Python or Go—is also essential.

Join Rise to see the full answer
Does Voltage Park offer any training or career development for Site Reliability Engineers?

Absolutely! Voltage Park values growth and encourages its Site Reliability Engineers to explore opportunities for professional development, whether through training sessions, new projects, or collaborative team learning.

Join Rise to see the full answer
Common Interview Questions for Site Reliability Engineer
Can you describe your experience with Linux servers?

When discussing your experience with Linux servers, focus on specific distributions you've worked with, particularly Ubuntu, and describe any challenges you faced or notable projects where you improved server performance or security.

Join Rise to see the full answer
How have you utilized AWS in your previous projects?

Provide examples of how you've managed cloud resources in AWS, including the services you used, any automation you've implemented, and how these experiences relate to supporting Voltage Park's infrastructure needs.

Join Rise to see the full answer
Explain your familiarity with Kubernetes and container orchestration.

Detail your experience with deploying applications in Kubernetes environments, including any specific issues you've overcome and how you maintain service reliability in container-based applications.

Join Rise to see the full answer
What is your process for troubleshooting performance issues in a data center?

Outline a systematic approach to performance troubleshooting, including how you collect metrics, analyze data, and what tools you employ. You may also want to share a specific instance where your troubleshooting led to a significant resolution.

Join Rise to see the full answer
Discuss your experience with automation tools like Terraform and Ansible.

Focus on specific scenarios where you effectively utilized Terraform or Ansible for infrastructure automation. Mention how these tools improved efficiency and reduced manual errors in your workflows.

Join Rise to see the full answer
How do you ensure good documentation in your SRE workflows?

Emphasize the importance of documentation and detail your specific practices and tools that you use to create and maintain comprehensive documentation as part of your routine.

Join Rise to see the full answer
What steps do you take to remain updated with new technologies in site reliability engineering?

Share how you keep yourself informed about the latest trends and technologies in the industry, such as attending webinars, participating in relevant forums, and continuously experimenting with new tools.

Join Rise to see the full answer
How do you prioritize workloads in a remote environment?

Discuss effective time management strategies that enable you to work independently while still aligning your priorities with team goals, perhaps through tools like priority matrices.

Join Rise to see the full answer
What experience do you have with monitoring and alerting systems?

Highlight the monitoring tools you’ve used, such as Prometheus or the ELK stack, and share how you’ve set up alerts for critical metrics and how you respond when alerts are triggered.

Join Rise to see the full answer
How do you approach collaborating with cross-functional teams?

Illustrate your collaborative style by sharing examples where teamwork led to successful project outcomes, highlighting communication strategies you find effective in a flat organization.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Voltage Park Hybrid San Francisco
Posted 13 days ago
Photo of the Rise User
Posted 4 days ago
Photo of the Rise User
Ajna Infotech Remote Morgan Ford Rd, St. Louis, MO, USA
Posted 6 days ago
Photo of the Rise User
Agile Defense Hybrid Fort Huachuca, AZ
Posted 14 days ago
Posted 13 days ago
Photo of the Rise User
Posted 6 days ago
Photo of the Rise User
Posted 13 days ago
Posted 11 days ago

voltage park is building a new class of cloud infrastructure from the ground up. join us, we're hiring!

11 jobs
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
November 27, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!