Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Site Reliability Engineer image - Rise Careers
Job details

Site Reliability Engineer

Be Part of Building the Future

Dremio is the unified lakehouse platform for self-service analytics and AI, serving hundreds of global enterprises, including Maersk, Amazon, Regeneron, NetApp, and S&P Global. Customers rely on Dremio for cloud, hybrid, and on-prem lakehouses to power their data mesh, data warehouse migration, data virtualization, and unified data access use cases. Based on open source technologies, including Apache Iceberg and Apache Arrow, Dremio provides an open lakehouse architecture enabling the fastest time to insight and platform flexibility at a fraction of the cost.  Learn more at www.dremio.com.

About the role

Dremio’s SREs ensure that internal and externally visible services have reliability and uptime appropriate to users' needs and a fast rate of improvement. You will be joining a small team of experienced SREs helping to deliver a world class experience to Dremo Cloud customers. Our systems, like many, are joint-cognitive, made up of both people and software: complex and therefore intrinsically hazardous. We understand and expect that catastrophe is always just around the corner.

What you’ll be doing

  • Drive continuous improvements to our usage of Kubernetes, our Operators, and the GitOps deployment paradigm.
  • Extend our networking, service mesh and Kubernetes systems to support connectivity between GCP, AWS and Azure.
  • Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and service reviews.
  • Help define and instrument Service Level indicators and objectives (SLIs/SLOs) with service owners in the Engineering teams. Develop SLO-based on-call strategies for service owners and their teams.
  • Collaborate within our virtual Observability team: develop and improve observability (tracing, events, metrics, profiling, logging and exceptions) of the Dremio Cloud product.
  • Ability to debug and optimize code written by others and automate routine tasks. You recognize complexity and are familiar with multiple techniques to manage it but recognize the folly in complete rewrites.
  • Evangelize and advocate for resilience engineering and reliability practices across our organization.
  • Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Join an on-call rotation for systems and services that the SRE team owns.
  • Practice sustainable incident response and post-incident investigation analysis.
  • Drive the cultural, technical, and process changes to move towards a true continuous delivery model within the company. 

What we’re looking for

  • 3+ years of relevant experience in the following areas: SRE, DevOps, Distributed Systems, Cloud Operations, Software Engineering.
  • Familiarity in Kubernetes, Istio, Terraform, ArgoCD/Flux.
  • Familiarity with software defined networking infrastructure: dedicated and partner interconnects, VPNs, BGP.
  • Excellent command of cloud services on GCP/AWS/Azure, CI/CD pipelines.
  • Have moderate-advanced experience in Python/Go, and at least reading knowledge of Java.
  • You are interested in designing, analyzing and troubleshooting large-scale distributed systems.
  • You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership, drive, and determination.
  • You have a great ability to debug and optimize code and automate routine tasks.

Bonus points if you have

  • Hands-on experience with large-scale production Kubernetes clusters (<=1000 nodes). 
  • You have developed SLIs/SLOs for production systems.

Return to Office Philosophy

Workplace Wednesdays - to break down silos, build relationships and improve cross-team communication. Lunch catering / meal credits provided in the office and local socials align to Workplace Wednesdays. In general, Dremio will remain a hybrid work environment. We will not be implementing a 100% (5 days a week) return to office policy for all roles.

#LI-JF1 #LI-Remote

What we value 

At Dremio, we hold ourselves to high standards when it comes to People, Thinking, and Action. Our Gnarlies (that's what we call our employees) communicate with clarity, drive accountability, and are respectful towards each other. We confront brutal facts and focus on results while operating with a sense of urgency and building a "flywheel". People who like to jump in and drive momentum will thrive in our #GnarlyLife.

Dremio is an equal opportunity employer supporting workforce diversity. We do not discriminate on the basis of race, religion, color, national origin, gender identity, sexual orientation, age, marital status, protected veteran status, disability status, or any other unlawful factor.

Dremio is committed to providing any necessary accommodations for individuals with disabilities within our application and interview process. To request accommodation due to a disability, please inform your recruiter.

Dremio has policies in place to protect the personal information that employees and applicants disclose to us. Please click here to review the privacy notice. 

Important Security Notice for Candidates

At Dremio, we uphold trust and transparency as paramount values in all our interactions with customers, partners, employees, and the general public. We have been targeted by individuals creating fake domains similar to ours to scam prospects and candidates. Please note that all official communications from us will be from an @dremio.com domain. If you suspect you've been targeted by a scam, it's imperative to report the incident to your local law enforcement agencies. For more information about this type of scam, please refer to Dremio's official statement here.

Dremio is not responsible for any fees related to unsolicited resumes and will not pay fees to any third-party agency or company that does not have a signed agreement with the Company.

Dremio Glassdoor Company Review
3.8 Glassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon Glassdoor star icon
Dremio DE&I Review
3.5 Glassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon Glassdoor star icon
CEO of Dremio
Dremio CEO photo
Billy Bosworth
Approve of CEO

Average salary estimate

$70000 / YEARLY (est.)
min
max
$60000K
$80000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Site Reliability Engineer, Dremio

Join Dremio as a Site Reliability Engineer in Hyderabad, Telangana, where you'll be a pivotal part of building a future-proof unified lakehouse platform for self-service analytics and AI! We’re trusted by major global enterprises like Maersk and Amazon, powered by cutting-edge open source technologies including Apache Iceberg and Apache Arrow. Our SRE team ensures that our services are reliable and meet the fast-paced needs of users. You’ll work closely with experienced engineers to enhance our systems' resiliency and performance. Your role will involve driving improvements in Kubernetes and GitOps deployment, collaborating with engineering to ensure services are ready for launch, and supporting our networking needs across major cloud platforms. With your expertise in coding and automation, you’ll contribute to debugging and optimizing our systems, aiming for seamless scalability. If you’re passionate about reliability engineering and eager to push for impactful changes, join us on this journey at Dremio, where your skills in system architecture and incident response will shine. You’ll also be part of a culture that values collaboration, transparency, and high standards, all while enjoying a flexible hybrid work environment. Let’s innovate together and redefine the data management landscape!

Frequently Asked Questions (FAQs) for Site Reliability Engineer Role at Dremio
What are the responsibilities of a Site Reliability Engineer at Dremio?

As a Site Reliability Engineer at Dremio, you'll drive continuous improvements in our Kubernetes usage, collaborate with engineering on service readiness, and develop automation strategies. You'll also establish Service Level Indicators and Objectives (SLIs/SLOs), participate in incident response, and promote resilience engineering best practices across the organization.

Join Rise to see the full answer
What qualifications are required for the Site Reliability Engineer position at Dremio?

To qualify for the Site Reliability Engineer role at Dremio, you should have at least 3 years of experience in SRE, DevOps, or similar fields. Familiarity with tools and technologies like Kubernetes, Istio, Terraform, and cloud services such as GCP, AWS, and Azure is crucial, along with a strong command of programming languages like Python or Go.

Join Rise to see the full answer
How does Dremio support employees in a hybrid work environment?

Dremio embraces a hybrid work model, allowing employees to work from home while also fostering team collaboration through Workplace Wednesdays. These in-office days promote relationship-building and communication, complemented by provided meal credits and local social events to enhance community.

Join Rise to see the full answer
What is the on-call rotation like for Site Reliability Engineers at Dremio?

Site Reliability Engineers at Dremio participate in an on-call rotation to support the systems and services they manage. This entails responding to incidents, conducting post-incident analyses, and implementing improvements for future reliability, ensuring a healthy work-life balance through structured management of on-call duties.

Join Rise to see the full answer
How does Dremio promote a culture of continuous improvement in reliability engineering?

At Dremio, the culture of continuous improvement in reliability engineering is ingrained within the SRE team. Engineers are encouraged to advocate for reliability practices, perform post-mortem analyses, and lead initiatives that drive technical and process changes aimed at enhancing service reliability and operational efficiency.

Join Rise to see the full answer
Common Interview Questions for Site Reliability Engineer
Can you explain your experience with Kubernetes and how it relates to the Site Reliability Engineer role at Dremio?

Speak to specific projects where you utilized Kubernetes, detailing how you managed deployments, scaled applications, or resolved issues. Highlight your understanding of best practices in Kubernetes environments and how you might apply these experiences to improve Dremio’s service reliability.

Join Rise to see the full answer
What strategies do you use for incident response as a Site Reliability Engineer?

Discuss your systematic approach to incident response, highlighting processes such as defining escalation paths, conducting post-incident reviews, and improving documentation for faster resolution times. Emphasize teamwork, communication, and how you incorporate lessons learned into ongoing practices.

Join Rise to see the full answer
How do you establish and monitor SLIs and SLOs in cloud environments?

Explain your experience identifying key metrics based on user experience and service performance. Share methodologies you've used for tracking SLIs/SLOs, setting targets, and how monitoring tools enable proactive management of service reliability.

Join Rise to see the full answer
Describe your experience debugging and optimizing code. Can you provide an example?

Share a specific instance where you identified and resolved issues in a codebase, discussing the tools and methodologies you employed. Highlight your troubleshooting skills and your thought process in making optimizations that resulted in improved system performance.

Join Rise to see the full answer
What does resilience engineering mean to you, and how would you promote it at Dremio?

Define resilience engineering in your own words, emphasizing its importance in maintaining uptime and reliability. Offer ideas on fostering a culture of resilience, such as leading workshops, sharing case studies, or proposing frameworks to embed resilience into engineering processes.

Join Rise to see the full answer
How familiar are you with GitOps and its practices?

Explain the principles of GitOps and your experience applying them. Share examples where you have implemented GitOps workflows to automate deployments, enhance collaboration among teams, and ensure consistency and traceability in the infrastructure as code.

Join Rise to see the full answer
Can you give an example of a complex problem you solved in a distributed system?

Detail a specific scenario involving a challenge in a distributed system, describing how you approached the problem with a methodical analysis and the steps taken to find a solution. Emphasize teamwork and outcomes that improved system reliability.

Join Rise to see the full answer
What tools do you find most effective for observability and monitoring?

Discuss the observability tools you have found most effective, such as Prometheus, Grafana, or New Relic, and explain how you've used them to track system health, analyze performance data, and increase reliability within past roles.

Join Rise to see the full answer
How do you ensure smooth collaboration with engineering teams?

Talk about your communication strategies and practices to ensure meaningful collaboration with engineering teams. Highlight the importance of setting clear expectations, conducting regular meetings, and ensuring alignment on service readiness prior to deployment.

Join Rise to see the full answer
What makes you a strong candidate for the Site Reliability Engineer position at Dremio?

Summarize your relevant experience, demonstrating how it aligns with Dremio’s goals. Discuss your passion for reliability engineering, your collaborative spirit, and your commitment to continuous improvement as key aspects that would contribute to the SRE team’s success.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Dremio Remote Portugal - Remote
Posted 7 days ago
Inclusive & Diverse
Collaboration over Competition
Growth & Learning
Fast-Paced
Transparent & Candid
Medical Insurance
Dental Insurance
Vision Insurance
401K Matching
Disability Insurance
Paid Time-Off
Paid Volunteer Time
Flex-Friendly
Maternity Leave
Paternity Leave
Paid Holidays
Photo of the Rise User
Posted 7 days ago
Inclusive & Diverse
Collaboration over Competition
Growth & Learning
Fast-Paced
Transparent & Candid
Medical Insurance
Dental Insurance
Vision Insurance
401K Matching
Disability Insurance
Paid Time-Off
Paid Volunteer Time
Flex-Friendly
Maternity Leave
Paternity Leave
Paid Holidays
Photo of the Rise User
Posted 6 days ago
Inetum Remote Paris, France
Posted 12 days ago
Photo of the Rise User
Veolia Environnement SA Hybrid 461 From Rd, Paramus, NJ 07652, USA
Posted 14 days ago
Photo of the Rise User
Posted 13 days ago
Posted 14 days ago

Dremio revolutionizes analytics by offering a user-friendly and open data lakehouse that merges data warehouse capabilities with the flexibility of data lakes, enhancing self-service analytics and speeding up insights across all data sources.

52 jobs
MATCH
Calculating your matching score...
BADGES
Badge ChangemakerBadge Diversity ChampionBadge Flexible CultureBadge Global Citizen
CULTURE VALUES
Inclusive & Diverse
Collaboration over Competition
Growth & Learning
Fast-Paced
Transparent & Candid
BENEFITS & PERKS
Medical Insurance
Dental Insurance
Vision Insurance
401K Matching
Disability Insurance
Paid Time-Off
Paid Volunteer Time
Flex-Friendly
Maternity Leave
Paternity Leave
Paid Holidays
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
March 27, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!