Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Tech Lead, Site Reliability Engineering (SRE) image - Rise Careers
Job details

Tech Lead, Site Reliability Engineering (SRE)

At Edge & Node, we’re focused on building The Graph, a decentralized protocol for accessing and organizing the world’s knowledge and information. Subgraphs, a core technology developed by Edge & Node to access blockchain data, are widely used across web3 to power decentralized applications.

We’re a tight-knit, efficient team with a bias for action and a strong sense of ownership. Our teams have autonomy, low ego, and are trusted to drive projects end to end. We care deeply about building infrastructure for web3 use cases and collaborate across disciplines to make that happen. If you’re passionate about infrastructure that has a real impact on our users, enjoy solving hard problems, and thrive in a fast-paced environment, you’ll feel right at home.

The Engineering Operations team, including Site Reliability, works closely with engineering teams across Edge & Node to ensure the services we operate are reliable, performant, predictable, and secure. We are on a mission to take our service delivery to the next level.


What You'll Do

  • Lead by example as a hands-on technical contributor, participating in on-call rotations, incident response, and the day-to-day work of the SRE team

  • Partner with engineering and product leadership to shape roadmaps, define team priorities, and plan work that improves reliability, performance, and scalability across the stack

  • Team with and support other SREs, leveraging your leadership and soft skills to foster a culture of continuous learning, blameless retrospectives, and technical excellence

  • Own the incident lifecycle, including root cause analysis and follow-up remediation, and work to make our systems increasingly self-healing

  • Drive SRE team strategy, advocating for industry best practices, standardization, and secure and optimized infrastructure

  • Architect and improve core infrastructure services, with an eye toward high availability, fault tolerance, performance, and end-to-end observability

  • Work across teams to challenge assumptions, fundamentally overhaul our systems, and improve documentation

  • Collaborate with external partners and vendors as needed to ensure the health of critical services

What We’re Looking For

  • Proven experience as a senior or lead SRE or devops engineer, ideally having led large-scale reliability initiatives or infrastructure transformation projects

  • Strong project or technical leadership skills, with a track record of guiding teammates and setting technical direction while still remaining hands-on

  • Deep knowledge of the SRE/devops domain, including incident response, security awareness, maintaining SLAs and uptime guarantees, observability, supporting internal development teams, project and capacity planning, and/or system architecture

  • Experience with both cloud and on-prem core infrastructure, ideally with Google Cloud Platform (GCP), bare metal infra, and kubernetes (or similar orchestration tools)

  • Fluency in infrastructure as code, Terraform, automation tooling, CI/CD pipelines, and system monitoring solutions such as Grafana

  • Excellent interpersonal, leadership, and communication skills, with the ability to align stakeholders and motivate and unblock team members

  • Experience in web3, crypto, or blockchain is a plus (but not required)
    _____

    About The Graph
    The Graph is the indexing and query layer of the decentralized internet. As the first open data marketplace to introduce and standardize subgraphs, The Graph is a flagship solution for accessing blockchain data across web3.

    Since launching in 2018, tens of thousands of developers have built subgraphs to power dapps across 90+ blockchains. As demand for web3 data grows, The Graph is evolving to support a broader range of data services and query languages, expanding what’s possible with decentralized infrastructure—now and in the future.

    Discover more about how The Graph is shaping the future of decentralized physical infrastructure networks (DePIN) by following The Graph on X, LinkedIn, Instagram, Facebook, Reddit, and Medium. Join the community on The Graph’s Telegram, and join technical discussions on The Graph’s Discord.

Average salary estimate

$150000 / YEARLY (est.)
min
max
$120000K
$180000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Tech Lead, Site Reliability Engineering (SRE), Edge & Node

As a Tech Lead for Site Reliability Engineering (SRE) at Edge & Node, you will be joining an innovative team that is passionately focused on building The Graph, a decentralized protocol designed to optimize access to the world's vast knowledge and information. We're not just developers; we're curators of a decentralized future, and your expertise will be crucial in enhancing our infrastructure to support web3 applications. In this hands-on role, you'll lead by example in incident response and on-call rotations, while also driving reliability efforts and technical excellence across our organization. Your leadership will help shape project roadmaps alongside engineering and product leadership, allowing us to deliver reliable and performant services. Not only will you support other SREs with your technical insights, but you will also champion a culture of learning and continuous improvement—where blameless retrospectives help us grow stronger. Architects of high availability, fault tolerance, and observability, our SREs strive to make our systems self-healing. We’re looking for someone who is not only knowledgeable in the SRE domains, such as incident response and system architecture but is also adept at project leadership, and is skilled with tools like Terraform and Kubernetes. If you’re excited about making a real impact and fostering collaboration across teams, you’ll be a perfect fit at Edge & Node. Explore the world of decentralized infrastructure with us and make your mark in shaping the future!

Frequently Asked Questions (FAQs) for Tech Lead, Site Reliability Engineering (SRE) Role at Edge & Node
What are the primary responsibilities of a Tech Lead, Site Reliability Engineering (SRE) at Edge & Node?

As a Tech Lead, Site Reliability Engineering (SRE) at Edge & Node, your primary responsibilities include leading the SRE team in managing incident response, driving service reliability and performance improvements, and architecting core infrastructure with an emphasis on fault tolerance. You're expected to collaborate closely with engineering teams, influence technical decisions, and support other SREs in their professional development.

Join Rise to see the full answer
What qualifications are required for the Tech Lead, Site Reliability Engineering (SRE) position at Edge & Node?

To qualify for the Tech Lead, Site Reliability Engineering (SRE) position at Edge & Node, candidates should have proven experience as a senior SRE or DevOps engineer, along with strong technical leadership skills. Knowledge of the SRE/devOps domain, cloud infrastructure (especially Google Cloud Platform), observability tools, and a solid understanding of infrastructure as code is essential. Experience in web3 or blockchain technologies is beneficial but not mandatory.

Join Rise to see the full answer
What does a typical day look like for a Tech Lead, Site Reliability Engineering (SRE) at Edge & Node?

A typical day for a Tech Lead, Site Reliability Engineering (SRE) at Edge & Node consists of a mix of hands-on technical contributions, daily stand-ups with the SRE team, incident management, and strategic planning with product leadership. You’ll also focus on mentoring your team, driving initiatives for improving service delivery, and engaging in cross-team collaboration to challenge existing assumptions and improve infrastructure.

Join Rise to see the full answer
How does the Tech Lead, Site Reliability Engineering (SRE) role contribute to Edge & Node's mission?

The Tech Lead, Site Reliability Engineering (SRE) at Edge & Node plays a vital role in ensuring the reliability and performance of our decentralized protocols and applications. By leading teams to implement best practices and architect resilient infrastructure, you directly contribute to Edge & Node's mission of building The Graph, which is essential for navigating the future of decentralized knowledge and information access.

Join Rise to see the full answer
Is experience in web3 or blockchain necessary for the Tech Lead, Site Reliability Engineering (SRE) role at Edge & Node?

While experience in web3 or blockchain is a plus for the Tech Lead, Site Reliability Engineering (SRE) role at Edge & Node, it is not strictly necessary. What’s most important is proven experience in SRE or DevOps practices, technical leadership skills, and expertise in relevant tools and technologies. However, an interest in the decentralized ecosystem will certainly enhance your impact in this role.

Join Rise to see the full answer
Common Interview Questions for Tech Lead, Site Reliability Engineering (SRE)
Can you explain what Site Reliability Engineering (SRE) means and how it differs from DevOps?

Site Reliability Engineering (SRE) combines software engineering principles with IT operations to create scalable and reliable software systems. Unlike traditional DevOps, which focuses more broadly on collaboration between development and operations teams, SRE emphasizes quantifiable metrics for service reliability, such as SLAs and error budgets, aiming to improve both the performance and the reliability of services.

Join Rise to see the full answer
What strategies would you use to improve system reliability and availability?

To improve system reliability and availability, I would start by implementing comprehensive monitoring solutions to track key performance indicators. Establishing automated incident response procedures, conducting post-mortems for self-healing systems, and enforcing SLAs can greatly enhance communication around reliability. Additionally, regularly reviewing architecture and implementing redundancy measures can mitigate single points of failure.

Join Rise to see the full answer
How do you handle incidents and what steps do you take for root cause analysis?

In handling incidents, my approach begins with effective communication to stakeholders, followed by triaging the issue to restore service as quickly as possible. After resolution, I conduct a thorough root cause analysis to identify the underlying factors, followed by documenting the findings and determining corrective actions to prevent recurrence. A blameless retrospective helps facilitate learning and bolstering team resilience.

Join Rise to see the full answer
What tools and technologies are you familiar with in the SRE domain?

I have extensive experience with tools such as Terraform for infrastructure as code, Kubernetes for orchestration, and various CI/CD tools for automation. Additionally, I am skilled in monitoring and alerting solutions like Grafana and Prometheus, which aid in observability and incident response. Mastery of cloud services, particularly Google Cloud Platform, is also part of my technical toolkit.

Join Rise to see the full answer
Describe a successful project you've led that improved system performance.

In a previous role, I led a project that involved re-architecting our core microservices for improved load balancing and failover capabilities. By leveraging containerization and optimizing deployment processes, we were able to reduce downtime by 40% and increase system responsiveness, resulting in enhanced user experience and satisfaction.

Join Rise to see the full answer
How would you encourage a culture of continuous learning within the SRE team at Edge & Node?

Encouraging a culture of continuous learning within the SRE team at Edge & Node involves creating an environment that supports knowledge sharing, mentorship, and frequent retrospectives. Initiatives like lunch-and-learns, encouraging team members to pursue professional certifications, and fostering discussions around emerging technologies will empower the SRE team to grow together.

Join Rise to see the full answer
What is your experience with cloud infrastructure and which platforms do you prefer?

I have significant experience with various cloud platforms, particularly Google Cloud Platform, where I've managed large-scale deployments and utilized services such as Kubernetes Engine and Cloud Monitoring. My preference for GCP is due to its robust features and integrations, but I'm also adept with AWS and Azure as needed for specific project requirements.

Join Rise to see the full answer
How do you stay updated with industry trends and best practices in SRE?

I stay updated with industry trends and best practices in SRE by actively following thought leaders in the field on platforms like LinkedIn and Twitter, participating in webinars, and attending conferences. Additionally, I am part of online communities and forums where I can engage in discussions and learn from peers about the latest in site reliability practices.

Join Rise to see the full answer
What measures would you take to ensure security in a site reliability context?

To ensure security in a site reliability context, I would integrate security practices into our development life cycle, ensuring regular vulnerability assessments and applying security patches promptly. Building a culture of security awareness within the SRE team and collaborating with security specialists promote a holistic approach to safeguarding our infrastructure.

Join Rise to see the full answer
Can you describe a challenging problem you solved in a previous SRE role?

In my last role, we faced a significant latency issue that threatened our service uptime. By conducting performance profiling, we identified suboptimal database queries as the culprit. I spearheaded a series of optimizations that involved indexing and restructuring queries, ultimately reducing response times by 60% and improving overall system performance.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Crown Cork And Seal Hybrid US, Bucks County, PA; Pennsylvania, Yardley, PA
Posted 10 days ago

Crown is hiring a Senior Programmer Analyst to lead technical enhancements for critical business applications in a dynamic packaging environment.

Photo of the Rise User
Posted 13 days ago
Dental Insurance
Disability Insurance
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Vision Insurance
Performance Bonus
Family Medical Leave
Paid Holidays

Become a key player at GoodLeap by driving Salesforce solutions that revolutionize customer engagement and operational efficiency.

Posted 10 days ago

Lead the charge in optimizing electronic medical records at CommonSpirit Health at Home to improve patient care and operational efficiency.

Photo of the Rise User

Join SupportYourApp as an IT Compliance & Audit Specialist, where your expertise in information security will help safeguard our clients' sensitive data.

Photo of the Rise User
Posted 14 days ago

Join F.H. Furr as a Systems Support Administrator to enhance operational efficiency through effective system management and user training.

Photo of the Rise User
ManTech Hybrid US, Loudoun County, VA; Virginia, Chantilly, Loudoun County, VA
Posted 10 days ago

Join ManTech as a Principal Cyber Network Engineer to optimize network security and drive operational efficiency in our Chantilly, VA team.

Photo of the Rise User
Posted 7 days ago

Elevate your career as an ERP Developer with TerrAscend, a leader in the cannabis industry, focusing on innovative ERP solutions.

Posted 3 days ago

A leading client in Austin is on the hunt for a seasoned Oracle Database Administrator to enhance their critical application databases.

MATCH
VIEW MATCH
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
April 2, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY
F
Someone from OH, Columbus just viewed VP of Communications at Freedom Together Foundation
Photo of the Rise User
Someone from OH, Columbus just viewed Chief Organizational Communication Officer at Providence
Photo of the Rise User
54 people applied to Security Analyst Jr at DEUNA
Photo of the Rise User
61 people applied to Cyber Crime Analyst at TEKsystems
Photo of the Rise User
Someone from OH, Cuyahoga Falls just viewed SEASONER at Shearer's Foods
Photo of the Rise User
Someone from OH, Columbus just viewed Bilingual Care Manager, Telephonic RN at Humana
Photo of the Rise User
Someone from OH, Columbus just viewed Talent Business Partner at Red Bull
Photo of the Rise User
8 people applied to GRC Analyst at Mercury
Photo of the Rise User
Someone from OH, Brunswick just viewed Sanitation Team Member at Shearer's Foods
Photo of the Rise User
Someone from OH, Columbus just viewed Talent Acquisition Specialist at Beghou Consulting
C
Someone from OH, Middletown just viewed Operations Analyst at Core Specialty Insurance
A
Someone from OH, Strongsville just viewed Graphic Design Intern at Anvil NorthWest
W
Someone from OH, Uhrichsville just viewed Director Operations at WVUMedicine
Photo of the Rise User
Someone from OH, Cincinnati just viewed Game Director, Scripps Sports at The E.W. Scripps Company
Photo of the Rise User
Someone from OH, Lorain just viewed 3D Modeler / Graphic Designer - Freelance at Twine
o
Someone from OH, Oxford just viewed Digital Media & Marketing Student Intern at osu
Photo of the Rise User
Someone from OH, Beachwood just viewed Dispensary Tech at Ayr Wellness
Photo of the Rise User
Someone from OH, Springfield just viewed Front Desk Clerk at Marriott International
Photo of the Rise User
Someone from OH, Columbus just viewed Licensing and Regulatory Compliance Analyst at Sportradar
Photo of the Rise User
Someone from OH, Mansfield just viewed US_EN_Operations_Warehouse Loader (Part Time) at Red Bull