Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Senior Site Reliability Engineer image - Rise Careers
Job details

Senior Site Reliability Engineer

Be Part of Building the Future

Dremio is the unified lakehouse platform for self-service analytics and AI, serving hundreds of global enterprises, including Maersk, Amazon, Regeneron, NetApp, and S&P Global. Customers rely on Dremio for cloud, hybrid, and on-prem lakehouses to power their data mesh, data warehouse migration, data virtualization, and unified data access use cases. Based on open source technologies, including Apache Iceberg and Apache Arrow, Dremio provides an open lakehouse architecture enabling the fastest time to insight and platform flexibility at a fraction of the cost.  Learn more at www.dremio.com.

About the role

Dremio’s SREs ensure that internal and externally visible services have reliability and uptime appropriate to users' needs and a fast rate of improvement. You will be joining a small but mighty team of experienced SREs helping to deliver a world class experience to Dremo Cloud customers. Our systems, like many, are joint-cognitive, made up of both people and software: complex and therefore intrinsically hazardous. We understand and expect that catastrophe is always just around the corner.

What you’ll be doing

  • Drive continuous improvements to our usage of Kubernetes, our Operators, and the GitOps deployment paradigm.
  • Extend our networking, service mesh and Kubernetes systems to support connectivity between GCP, AWS and Azure.
  • Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and service reviews.
  • Help define and instrument Service Level indicators and objectives (SLIs/SLOs) with service owners in the Engineering teams. Develop SLO-based on-call strategies for service owners and their teams.
  • Collaborate within our virtual Observability team: develop and improve observability (tracing, events, metrics, profiling, logging and exceptions) of the Dremio Cloud product.
  • Ability to debug and optimize code written by others and automate routine tasks. You recognize complexity and are familiar with multiple techniques to manage it but recognize the folly in complete rewrites.
  • Evangelize and advocate for resilience engineering and reliability practices across our organization.
  • Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Join an on-call rotation for systems and services that the SRE team owns.
  • Practice sustainable incident response and post-incident investigation analysis.
  • Drive the cultural, technical, and process changes to move towards a true continuous delivery model within the company. 

What we’re looking for

  • 10+ years of relevant experience in the following areas: SRE, DevOps, Distributed Systems, Cloud Operations, Software Engineering.
  • Expertise in Kubernetes, Istio, Terraform, Terragrunt, ArgoCD/Flux.
  • Expertise with software defined networking infrastructure: dedicated and partner interconnects, VPNs, BGP.
  • Excellent command of cloud services on GCP/AWS/Azure, CI/CD pipelines.
  • Have moderate-advanced experience in Python/Go, and at least reading knowledge of Java.
  • You are interested in designing, analyzing and troubleshooting large-scale distributed systems.
  • You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership, drive, and determination.
  • You have a great ability to debug and optimize code and automate routine tasks.
  • You have a solid background in software development and architecting resilient and reliable applications.

Bonus points if you have

  • Hands-on experience with large-scale production Kubernetes clusters (<=1000 nodes). 
  • You have developed SLIs/SLOs for production systems.

Return to Office Philosophy

Workplace Wednesdays - to break down silos, build relationships and improve cross-team communication. Lunch catering / meal credits provided in the office and local socials align to Workplace Wednesdays. In general, Dremio will remain a hybrid work environment. We will not be implementing a 100% (5 days a week) return to office policy for all roles.

#LI-JF1 #LI-Remote

What we value 

At Dremio, we hold ourselves to high standards when it comes to People, Thinking, and Action. Our Gnarlies (that's what we call our employees) communicate with clarity, drive accountability, and are respectful towards each other. We confront brutal facts and focus on results while operating with a sense of urgency and building a "flywheel". People who like to jump in and drive momentum will thrive in our #GnarlyLife.

Dremio is an equal opportunity employer supporting workforce diversity. We do not discriminate on the basis of race, religion, color, national origin, gender identity, sexual orientation, age, marital status, protected veteran status, disability status, or any other unlawful factor.

Dremio is committed to providing any necessary accommodations for individuals with disabilities within our application and interview process. To request accommodation due to a disability, please inform your recruiter.

Dremio has policies in place to protect the personal information that employees and applicants disclose to us. Please click here to review the privacy notice. 

Important Security Notice for Candidates

At Dremio, we uphold trust and transparency as paramount values in all our interactions with customers, partners, employees, and the general public. We have been targeted by individuals creating fake domains similar to ours to scam prospects and candidates. Please note that all official communications from us will be from an @dremio.com domain. If you suspect you've been targeted by a scam, it's imperative to report the incident to your local law enforcement agencies. For more information about this type of scam, please refer to Dremio's official statement here.

Dremio is not responsible for any fees related to unsolicited resumes and will not pay fees to any third-party agency or company that does not have a signed agreement with the Company.

Dremio Glassdoor Company Review
3.8 Glassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon Glassdoor star icon
Dremio DE&I Review
3.5 Glassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon Glassdoor star icon
CEO of Dremio
Dremio CEO photo
Billy Bosworth
Approve of CEO

Average salary estimate

$135000 / YEARLY (est.)
min
max
$120000K
$150000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Senior Site Reliability Engineer, Dremio

Join Dremio as a Senior Site Reliability Engineer in Hyderabad, Telangana, where you'll be part of an exciting journey in building the future of data analytics. At Dremio, we’re on a mission to provide organizations with a unified lakehouse platform that empowers their self-service analytics and AI pursuits. As an SRE, you’ll become part of a dynamic team dedicated to enhancing the reliability and performance of our services. You'll have the opportunity to drive continuous improvements in Kubernetes deployment and collaborate with engineering teams to ensure our services are flawlessly delivered. Your expertise will extend into networking, service mesh, and cloud connectivity, allowing you to shape how services interact across platforms like GCP, AWS, and Azure. You’ll also play a pivotal role in defining and implementing Service Level Indicators (SLIs) and Objectives (SLOs), ensuring high service quality and reliability. If you’re passionate about resilience engineering and improving efficiency through automation, this is the role for you. By maintaining a blend of strategic insight and deep technical skills, you’ll not only enhance our systems but also advocate for best practices across the organization. We believe in a hybrid work environment that promotes collaboration, allowing you to thrive whilst remaining flexible. If you’re ready to take the next step in your career and make a real impact, we’d love to talk to you!

Frequently Asked Questions (FAQs) for Senior Site Reliability Engineer Role at Dremio
What are the key responsibilities of a Senior Site Reliability Engineer at Dremio?

As a Senior Site Reliability Engineer at Dremio, your key responsibilities include driving improvements to Kubernetes and service mesh systems, collaborating with engineering teams to ensure production readiness, and developing SLO-based on-call strategies. You will also be involved in incident response, observability improvements, and advocating for reliability practices throughout the organization.

Join Rise to see the full answer
What qualifications are needed for the Senior Site Reliability Engineer position at Dremio?

The ideal candidate for the Senior Site Reliability Engineer position at Dremio should have over 10 years of relevant experience in SRE, DevOps, or Cloud Operations. Expertise in Kubernetes, Terraform, and cloud services like GCP, AWS, and Azure is essential. Proficiency in programming languages, particularly Python or Go, is necessary, along with strong problem-solving and communication skills.

Join Rise to see the full answer
How does Dremio support professional development for Senior Site Reliability Engineers?

Dremio values the growth of its employees, including Senior Site Reliability Engineers, by fostering a collaborative environment that supports continuous learning. You will have access to various resources, including mentorship opportunities, training programs, and the chance to work on innovative projects that not only enhance your skills but also contribute significantly to the company's success.

Join Rise to see the full answer
What is the work culture like for a Senior Site Reliability Engineer at Dremio?

The work culture for a Senior Site Reliability Engineer at Dremio is built on collaboration, accountability, and open communication. The team enjoys a supportive atmosphere where ideas are freely shared. Dremio encourages a hybrid work model, allowing flexibility while promoting teamwork through events like Workplace Wednesdays, aimed at strengthening cross-team relationships.

Join Rise to see the full answer
What tools and technologies will I work with as a Senior Site Reliability Engineer at Dremio?

In the role of Senior Site Reliability Engineer at Dremio, you will be working with a range of cutting-edge tools and technologies, including Kubernetes, Istio, Terraform, ArgoCD, and various cloud services. Your role will also involve using monitoring and observability tools to enhance the reliability and performance of our systems within the cloud environment.

Join Rise to see the full answer
Common Interview Questions for Senior Site Reliability Engineer
Can you explain how your experience with Kubernetes aligns with the Senior Site Reliability Engineer role at Dremio?

To effectively answer this question, discuss specific projects where you utilized Kubernetes to manage applications. Highlight your experience with deployments, scaling, and troubleshooting within Kubernetes, and how these skills can contribute to ensuring reliability at Dremio.

Join Rise to see the full answer
What strategies do you use for incident response as a Senior Site Reliability Engineer?

When responding to this question, emphasize your approach to incident response, such as maintaining clear logs, using monitoring tools, and conducting post-incident analyses. Sharing real-life examples of how you've improved response times or reduced downtime can also make your answer stand out.

Join Rise to see the full answer
How do you define and implement Service Level Objectives (SLOs) in production systems?

Discuss your understanding of SLOs and the importance of aligning them with customer needs. Explain a step-by-step approach you’ve taken to define SLOs based on user feedback and service reliability, and how you monitor them using performance metrics.

Join Rise to see the full answer
What is your experience with automation in SRE practices?

Here, focus on specific automation tools and scripts you’ve implemented to reduce manual monitoring and deployments. Explain how automation improved system reliability or operational efficiency in your previous roles, providing metrics if possible.

Join Rise to see the full answer
Can you describe a challenging problem you solved in a previous SRE role?

Share a detailed account of a significant challenge you faced, illustrating the problem, your thought process, and the eventual outcome. Highlight the technical skills you employed and the impact of your solution on system performance.

Join Rise to see the full answer
In what ways do you ensure effective collaboration with engineering teams?

As you answer, highlight specific communication tools and methodologies you employ to build relationships with engineers. Discuss your experience in leading cross-functional meetings, promoting joint ownership of projects, and sharing insights to improve system design.

Join Rise to see the full answer
How do you stay updated with the latest trends in site reliability engineering?

Mention your proactive approach to learning, such as attending industry conferences, subscribing to relevant blogs, engaging in online communities, and participating in certification programs. This shows your dedication to remaining informed and enhancing your expertise.

Join Rise to see the full answer
What are the key metrics you analyze in a cloud environment to ensure reliability?

Discuss the critical performance metrics you track, such as uptime, response times, error rates, and resource utilization. Explain how analyzing these metrics helps you identify potential issues and improve the overall performance of cloud services.

Join Rise to see the full answer
Can you share your thoughts on resilience engineering and its importance?

Elaborate on the concept of resilience engineering, emphasizing its significance in preparing systems for unexpected challenges. Share examples of methodologies you’ve employed to enhance resilience within infrastructure and services, and why this matters for both customers and business continuity.

Join Rise to see the full answer
How do you manage workload during on-call rotations?

Describe your strategies for managing stress and workload during on-call rotations, such as setting priorities, maintaining documentation, and using tools for efficient incident management. Share an instance where effective management made a significant difference in your response.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Pinterest Remote San Francisco, CA, US; Remote, US
Posted 9 days ago
Photo of the Rise User
Anduril Industries Hybrid Costa Mesa, California, United States
Posted 2 days ago
Posted 10 days ago
Architus Remote No location specified
Posted 4 days ago
Photo of the Rise User
Posted 3 days ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Customer-Centric
Social Impact Driven
Passion for Exploration
Family Medical Leave
Maternity Leave
Paternity Leave
Family Coverage (Insurance)
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Tagup Hybrid No location specified
Posted 11 days ago
Photo of the Rise User
SpaceX Hybrid Flexible - Any SpaceX Site
Posted 3 days ago
Mission Driven
Social Impact Driven
Passion for Exploration
Reward & Recognition

Dremio revolutionizes analytics by offering a user-friendly and open data lakehouse that merges data warehouse capabilities with the flexibility of data lakes, enhancing self-service analytics and speeding up insights across all data sources.

46 jobs
MATCH
VIEW MATCH
BADGES
Badge ChangemakerBadge Diversity ChampionBadge Flexible CultureBadge Global Citizen
CULTURE VALUES
Inclusive & Diverse
Collaboration over Competition
Growth & Learning
Fast-Paced
Transparent & Candid
BENEFITS & PERKS
Medical Insurance
Dental Insurance
Vision Insurance
401K Matching
Disability Insurance
Paid Time-Off
Paid Volunteer Time
Flex-Friendly
Maternity Leave
Paternity Leave
Paid Holidays
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
March 15, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY
A
Someone from OH, Lewis Center just viewed 34505367634 - Fraud Analyst at Activate Talent
Photo of the Rise User
Someone from OH, Dublin just viewed Senior Third-Party Risk Analyst at Fenergo
Photo of the Rise User
Someone from OH, Columbus just viewed US Product Designer at Praxent
Photo of the Rise User
Someone from OH, Cleveland just viewed Accounting Co-Op (Part-Time) at Avery Dennison
Photo of the Rise User
Someone from OH, North Ridgeville just viewed Product Manager at ShiftCare
Photo of the Rise User
Someone from OH, North Ridgeville just viewed Product Operations at Binance
Photo of the Rise User
Someone from OH, Mentor just viewed Sales & Service Lead - Pinecrest at Alo Yoga
Photo of the Rise User
18 people applied to REMOTE Sr Piping Designer at Kelly
Photo of the Rise User
6 people applied to GIS Summer Intern at AECOM