Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Site Reliability Engineer (SRE) image - Rise Careers
Job details

Site Reliability Engineer (SRE)

 

Empowering every employee.

Our mission is to be the world's most used employee app by changing the way frontline employees work.

At Flip, we have a clear goal: to revolutionize the world for frontline workers and give them a voice. Become a Flip Game Changer and work with an unbeatable team to ensure that all employees, no matter where they work, have access to their company's internal information. If you're ready to make an impact and shape the work lives of millions of people, then you've come to the right place!

 

Your tasks:

You...

  • ensure the availability, performance and scalability of our infrastructure
  • promote practices such as CI/CD, observability and developer experience within our organization
  • shape our goals with us, such as scalable systems, observability and much more
  • enable scaling - further expand our cloud infrastructure and our Kubernetes cluster
  • ensure resilience & safety - among other things through zero-downtime roll outs and rollback mechanisms
  • create observability - through the further development of our LGTM (Loki, Grafana, Tempo, Mimir) stack you optimize our SLOs
  • design, develop and optimize infrastructure as a code with Pulumi in Go

 

Qualifications

You...

  • have experience in operating and scaling cloud infrastructures (Azure, AWS, GCP)
  • have deep knowledge of Kubernetes and container solutions
  • are interested in observability (e.g. Prometheus, VictoriaMetrics, Mimir, Loki, ELK) and terms such as SLO, error budget, Apdex are no foreign words to you
  • have good knowledge of software development (e.g. Go, Python, Kotlin)
  • are business fluent in English, German is a plus
  • have experienced with infrastructure as code (e.g. Pulumi, OpenTofu) and automation tools (e.g. Ansible, Chef)

What we offer you

  • Home Office Friendly: Decide for yourself where you want to work every day.
  • Modern city offices in Stuttgart and Berlin or 100% remote: Both are possible with us, we at Flip have a hybrid work model.
  • Work-Life-Balance: We don't want you to grow roots to your desk chair. That's why we cover the costs of your E-Gym-Wellpass membership and offer job bike leasing. 
  • Celebrating success: Expect highly motivated and committed people in a relaxed working atmosphere.
  • Be part of something bigger: You actively shape Flip in your role. Along the way, you are an enabler of the rapid growth process of a young tech company and grow towards your goals, fun is guaranteed.
  • Happy to be a Flipster: Stay tuned for regular team events and culture days that bring us together as Flipsters.
  • Working abroad: At Flip you can also work abroad in the European Union. Let's talk about remote work in the interview.

 

At Flip, everyone is welcome - no matter what gender you identify as or how old you are. Sexual identity, origin, religion, world view and disabilities do not influence your potential job at Flip. The most important thing is that YOU fit in!

 

Average salary estimate

$75000 / YEARLY (est.)
min
max
$60000K
$90000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Site Reliability Engineer (SRE), Flip GmbH

As a Site Reliability Engineer (SRE) at Flip, you'll be joining a dynamic and innovative team dedicated to transforming the workplace for frontline employees. Our mission is to create the world's most utilized employee app, and that requires a robust and reliable infrastructure that you will help maintain. In this remote role, especially within Europe, you will ensure the availability, performance, and scalability of our cloud environments. Your expertise with Kubernetes and observability tools will be essential as you enable further expansion of our infrastructure and contribute to zero-downtime rollouts. With your knowledge of infrastructure as code using Pulumi in Go, you'll help us reach new heights in our goals for scalable systems and optimized service-level objectives. We believe that a good work-life balance is crucial, which is why we offer flexibility in working arrangements that allow you to work from anywhere, including the option for hybrid setups in our Stuttgart and Berlin offices. You’ll not only contribute to a pioneering tech company but also partake in a culture where every team member's voice is valued. If you're passionate about making a difference and eager to grow your skills, then Flip is the perfect place for you. So, if you're ready to embark on a rewarding journey that promises fun and fulfillment, join us and become a Flip Game Changer!

Frequently Asked Questions (FAQs) for Site Reliability Engineer (SRE) Role at Flip GmbH
What are the main responsibilities of a Site Reliability Engineer at Flip?

As a Site Reliability Engineer at Flip, your primary responsibilities will include ensuring the availability and performance of our infrastructure, promoting CI/CD practices, and optimizing our cloud systems. You'll play a key role in scaling our cloud infrastructure, managing our Kubernetes cluster, and enhancing our observability tools. Engaging with technologies like Pulumi and creating infrastructure as code will also be part of your day-to-day tasks, all while collaborating with a motivated team dedicated to reforming the way frontline employees experience their work.

Join Rise to see the full answer
What qualifications do I need to apply for the Site Reliability Engineer position at Flip?

To be a strong candidate for the Site Reliability Engineer role at Flip, you should have experience in managing cloud infrastructures such as AWS, Azure, or GCP, along with deep knowledge of Kubernetes and container solutions. Familiarity with observability frameworks and software development skills in languages like Go, Python, or Kotlin are also essential. If you have experience with automation tools and infrastructure as code using Pulumi or similar technologies, it will be a significant advantage.

Join Rise to see the full answer
What is the work culture like for a Site Reliability Engineer at Flip?

The work culture at Flip for a Site Reliability Engineer emphasizes collaboration, fun, and flexibility. You will find a relaxed atmosphere with highly motivated colleagues who are committed to making a difference. We prioritize work-life balance and ensure that our employees happy and engaged in their roles. Regular team events and culture days foster camaraderie while inspiring personal and professional growth, making Flip not just a workplace but a community.

Join Rise to see the full answer
Is the Site Reliability Engineer position at Flip fully remote?

Yes, the Site Reliability Engineer position at Flip offers the flexibility of a fully remote setup, particularly for candidates located within Europe. You have the option to work from home, choose a hybrid style in our modern offices in Stuttgart or Berlin, or even work abroad within the European Union. This autonomy helps our team members maintain a balance between their professional and personal lives.

Join Rise to see the full answer
What tools and technologies will a Site Reliability Engineer use at Flip?

As a Site Reliability Engineer at Flip, you will engage with a variety of cutting-edge tools and technologies. Your primary focus will be on cloud platforms like AWS, Azure, or GCP, as well as container orchestration solutions like Kubernetes. Additionally, you will work with observability tools such as Grafana, Prometheus, and Loki to create and maintain system visibility. Familiarity with infrastructure as code using Pulumi, automation tools like Ansible or Chef, and modern programming languages like Go will also be crucial to your success in this role.

Join Rise to see the full answer
Common Interview Questions for Site Reliability Engineer (SRE)
How do you ensure system reliability in your previous projects as a Site Reliability Engineer?

To ensure system reliability in my previous projects, I focused on several key areas: implementing rigorous monitoring and alerting systems, conducting regular chaos engineering exercises, and emphasizing automation for deployment processes. By continuously measuring service-level objectives and adjusting our approach based on observed performance, I was able to effectively increase system resilience.

Join Rise to see the full answer
Can you describe your experience with Kubernetes in the context of site reliability engineering?

In my experience with Kubernetes, I've managed container orchestration, optimized resource allocation, and ensured high availability of services. I've used Helm for package management and implemented monitoring solutions for clusters. My role involved troubleshooting issues and performing upgrades, always with minimal downtime to maintain reliability and performance.

Join Rise to see the full answer
What methods do you use for continuous integration and continuous deployment (CI/CD)?

I utilize Jenkins and GitLab CI/CD for managing CI/CD pipelines, which automate the building, testing, and deployment processes. Integrating unit tests and performance tests in the pipeline ensures that only well-tested code gets promoted to production. Additionally, I advocate for blue-green deployments and canary releases to mitigate risks during production updates.

Join Rise to see the full answer
How do you handle a situation where a service goes down?

When a service goes down, my immediate priority is to assess the impact and initiate an incident response plan. This includes identifying the root cause through logs and metrics, communicating with stakeholders about the issue, and working collaboratively with the team to implement a fix as quickly as possible. Post-incident, I conduct a blameless retrospective to understand what went wrong and refine our processes to prevent recurrence.

Join Rise to see the full answer
What is your experience with infrastructure as code (IaC)?

I have extensive experience with infrastructure as code, primarily using Pulumi and Terraform to manage and provision cloud resources. This practice allows for more consistent environments, version control, and easier replication of infrastructure. I focus on modularizing code, enabling easy updates, and promoting best practices across teams to enhance collaboration and efficiency.

Join Rise to see the full answer
How do you approach monitoring and observability in your projects?

I advocate for creating an observability culture within teams by integrating logging, metrics, and tracing into our applications. Using tools like Prometheus for metrics collection and Grafana for visualization enables us to monitor system performance effectively and quickly diagnose issues. Setting up alerts based on error budgets or SLO targets helps maintain a resilient system.

Join Rise to see the full answer
What programming languages are you proficient in, and how have you used them in SRE?

I am proficient in several programming languages, including Go and Python. I’ve used Go to develop microservices that are both scalable and efficient, while Python has been instrumental in scripting automation tasks and leveraging APIs for monitoring tools. This combination allows me to build robust applications that align with SRE principles.

Join Rise to see the full answer
How do you prioritize tasks when managing multiple incidents?

When managing multiple incidents, I prioritize tasks based on severity and potential impact on users. Creating a triage system that categorizes issues by urgency helps me address the most critical ones first. Furthermore, clear communication with stakeholders ensures everyone is informed, which aids in managing expectations throughout the incident lifecycle.

Join Rise to see the full answer
What is your understanding of service-level objectives (SLOs)?

Service-level objectives (SLOs) define the target level of reliability for a service, typically expressed as a percentage of uptime or success rate. They are crucial for guiding developers, and as an SRE, I ensure regular assessments of SLOs to align with business objectives, adjusting as necessary based on risks and feedback from incident postmortems.

Join Rise to see the full answer
Can you describe a challenging problem you solved in an SRE capacity?

A challenging problem I encountered was managing traffic spikes during a major event. We implemented auto-scaling in conjunction with a load-balancer to handle sudden increases in user activity. By closely monitoring system metrics, we ensured that resources were allocated dynamically, which ultimately maintained service availability and user satisfaction despite the high load.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Kraken Remote No location specified
Posted 10 days ago
Photo of the Rise User
Zscaler Remote San Jose, California, United States
Posted 5 days ago
Photo of the Rise User
Posted 10 days ago
Posted 5 days ago
Mission Driven
Social Impact Driven
Rapid Growth
Maternity Leave
Paternity Leave
Muse Remote No location specified
Posted 3 days ago
Photo of the Rise User
Elegen Hybrid San Carlos, California, United States
Posted 11 days ago
Photo of the Rise User
Posted 2 days ago
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
LOCATION
No info
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
December 4, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!