Sign up for our
weekly
newsletter
of fresh jobs
We are seeking a seasoned Site Reliability Engineering (SRE) Manager to lead a team of SRE staff supporting a Network Automation team in a follow-the-sun support model. The team manages critical applications and infrastructure both Cloud and on prem for datacenter deployment and automated operations. This role is pivotal in not only the maintenance of resilient, scalable systems, but ownership of the general architecture of distributed systems, including a new DNS architecture and a distributed source of truth sync.This role goes beyond an understanding of standard best practices and operations. We're combining technical knowledge with development chops to come up with solutions at cloud scale. This is a new team operating in an exciting, groundbreaking environment, and we want you to help us shape it.What you will be doing:• Team Leadership: Manage and mentor a team of Automation SREs, fostering a culture of collaboration, innovation, and excellence in execution.• Technical Guidance: Own technical decisions for the team, ensuring alignment with developers and employing industry standard methodologies• Operational Excellence: Implement and maintain robust operational practices, including incident management, monitoring, alerting, and capacity planning• Shift Scheduling: Coordinate follow-the-sun support across global time zones, ensuring 24/7 coverage and efficient handovers• Project Management: Lead initiatives related to the design, deployment, and maintenance of critical infrastructure components• Release Management: Oversee release processes and ensure smooth deployments, minimizing downtime and impact on users• Root Cause Analysis: Conduct thorough post-incident reviews, identifying root causes and implementing preventive measuresWhat we need to see:• 8+ years of experience in the industry, with a focus on Site Reliability Engineering, with a strong background in cloud service providers, ISPs, or similar service-oriented networking companies• Technical Skills: Proficiency in managing distributed web infrastructures, designing scalable and resilient systems, and implementing network automation• Leadership: Proven track record of managing technical teams, including performance management, career development, and hiring - 2+ yrs of management experience• Problem Solving: Demonstrated ability to conduct detailed root cause analysis and drive improvements based on findings• Communication: Excellent verbal and written communication skills, with experience presenting technical information to diverse audiences• Education: Bachelor's degree in Computer Science, Engineering, or a related technical field, or relevant industry experienceIf you are a strategic problem solver with a passion for leading high-performance teams in a dynamic and technically challenging environment, we encourage you to apply. Join us in shaping the future of our distributed systems and network automation infrastructure.NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you are creative and autonomous, we want to hear from you!The base salary range is 164,000 USD - 258,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.