Job details

Site Reliability Engineer - Observability

Get a free resume review

Job Description

This would not be possible without a state of the art core platform as the backbone managed by the dedicated SME teams in our Core Group. One such team is the Performance and Observability Team having two sub-tracks as the name implies.

We are looking to expand our team across the Observability sub-track. The position is ideally suited to experienced candidates from Software Engineering / SRE / DevOps backgrounds with deep focus on Observability stack and best practices.

Our team currently has four members, each having a diverse background and perspective. Together we own an observability stack that handles billions of metrics, traces and log entries each month. Our customers are all the engineers at Wolt who use this stack to understand the health of their services / infrastructure at scale.

As of today, we manage a complex observability stack, covering a wide scope from application instrumentation and telemetry data collection to visualization and alerting, spanning both backend and client-facing applications. Our daily responsibilities ensure that this ecosystem operates seamlessly. In parallel, we're building the next-generation observability platform, re-architecting our stack and pipelines in collaboration with our counterparts at DoorDash. This partnership provides an unparalleled opportunity to drive high-impact initiatives across the observability domain, offering empowerment and involvement in cutting-edge projects.

Qualifications

What you’ll do :

Be responsible for building and improving our observability platform and tooling, used by all Wolt engineers.
Contribute to initiatives focused on architecting, building, and maintaining our observability stack to efficiently handle increasing telemetry data with greater reliability.
Champion observability best practices, guiding and supporting other Woltians in this space.
Take ownership of key initiatives to improve the quality, efficiency, and reliability of our observability stack.
Apply your expertise in SRE culture and practices to ensure observability has a meaningful impact on our business.
Participate in the on-call rotation to address incidents and outages, resolving reliability issues efficiently.
Help standardize observability resources by building tools and documentation that enhance productivity and developer experience.
Triage and resolve production issues within the observability scope.
Contribute to open-source efforts by sharing some of our internal tools with the broader community.

Qualifications:

Proven experience in Software Engineering, SRE, or a similar role with a focus on observability, reliability, and scaling large systems.
You have experience with OpenTelemetry, which is a key foundation for much of the infrastructure and tooling the team is converging on as part of our future observability strategy.
Strong foundation in computer science principles and engineering fundamentals.
Proficient in development, particularly in Go (preferred) or Python, with experience building automation tools and software for large-scale, distributed systems.
Hands-on experience with observability tooling such as DataDog, Prometheus, Mimir, Elasticsearch, Grafana, Jaeger, and tracing systems.
Expertise in cloud platforms like AWS, GCP, or Azure, with experience managing cloud infrastructure using Kubernetes and containers (Docker).
Deep knowledge of building and maintaining reliable, high-performance, and scalable distributed systems.
Solid understanding of SRE principles, incident response, and designing fault-tolerant architectures.
Experience with infrastructure-as-code tools like Terraform or Ansible for managing cloud environments.
Familiarity with CI/CD pipelines, automated testing, and continuous delivery practices.
Strong analytical and problem-solving skills, with experience troubleshooting complex distributed systems.
Excellent communication and collaboration skills, with the ability to work cross-functionally to enhance platform observability and reliability.
Experience working directly with development teams, with a willingness to dive into application code for observability-related topics, even when unfamiliar with the application code.
Solid experience with Docker and Kubernetes, coupled with a strong foundation in Unix systems and networking concepts.
Open to feedback, recognizing that no one is perfect—including us. We see feedback as an opportunity to learn and grow together.

Nice to Haves:

You have experience with handling data and running monitoring infrastructure at scale, such as managing petabyte-scale Elasticsearch clusters or similar databases
You have experience operating distributed event streaming platforms at scale e.g. Apache Kafka
Open-source contributions in observability, cloud, or platform engineering are a strong plus

Additional Information

📍This role can be based in one of our tech hubs in Helsinki, Berlin or Stockholm, or you can work remotely anywhere in Finland, Sweden, Germany, Denmark, and Estonia. Read more about our remote setup here. If you live outside of these countries - not to worry! We provide relocation support to help you make your way to Finland, Germany or Sweden.

The position will be filled as soon as we find the right people, so feel free to apply as soon as you feel like hearing more about the position and potentially joining Wolt & Doordash!

Wolt Glassdoor Company Review

3.7

Wolt DE&I Review

3.6

CEO of Wolt

Miki Kuusi

Approve of CEO

Average salary estimate

$75000 / YEARLY (est.)

min

max

$60000K

$90000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Site Reliability Engineer - Observability, Wolt

If you're an experienced Site Reliability Engineer with a passion for Observability, come join the dynamic team at Wolt in Helsinki! Our Performance and Observability Team is on the lookout for innovative minds to help us manage and enhance our observability platform. In this role, you'll dive into the depths of our observability stack that millions of metrics, traces, and log entries pass through monthly, ensuring our engineers have all the data they need to keep services running smoothly. You'll take ownership of initiatives that drive the architectural direction of our observability system, collaborating closely with talented counterparts at DoorDash. Your expertise will help champion observability best practices across the board, empowering everyone at Wolt to improve their services. As a Site Reliability Engineer focused on Observability, you’ll not only enhance existing tools, but you'll also have the opportunity to contribute to open-source projects and address real-time incidents as they occur. With a collaborative environment that values feedback and personal growth, this role is crafted for individuals eager to apply their deep technical skills in SRE, cloud computing, and distributed systems while being part of transformative projects. Your passion for craftsmanship and reliability will make a significant impact in our ever-evolving landscape. Plus, with the flexibility to work remotely from various countries, you can enjoy a fulfilling career without compromising your work-life balance. We can't wait for you to bring your skills and perspectives to Wolt!

Frequently Asked Questions (FAQs) for Site Reliability Engineer - Observability Role at Wolt

What are the responsibilities of a Site Reliability Engineer - Observability at Wolt?

As a Site Reliability Engineer focusing on Observability at Wolt, your primary responsibilities will include enhancing our observability platform and tools, ensuring they meet the growing needs of our users. You'll work on architecting, building, and maintaining our observability stack, focusing on reliability and efficiency. Additionally, you'll be responsible for championing observability best practices, participating in on-call rotations for incident resolution, and contributing to open-source projects.