Job details

DevOps Engineer

Atria is a membership-based preventive health care practice delivering cutting-edge primary and specialty care from the comfort of your home, at our practices in Palm Beach and New York, or wherever you are in the world.

We bring together a multidisciplinary team of renowned, in-house physicians to provide proactive, preventive, and precision-based care for Atria members and their families. We aim to optimize the lifespan and healthspan of all our members through meticulous screening and tailored interventions to prevent, reverse, or manage all major chronic diseases.

Each member’s care is led by a dedicated Chief Medical Officer who collaborates on your behalf with specialists in cardiology, neurology, pediatrics, gynecology, endocrinology, performance and movement, and more. Our exceptional clinicians also work closely with the 60+ members of the Atria Academy of Science & Medicine, top experts in their respective fields who are available for rapid consults, support, and referrals.

We are seeking a proactive and experienced DevOps Engineer to join our dynamic team. The ideal candidate will have in-depth experience with infrastructure-as-code tools like Terraform, cloud infrastructure management on Google Cloud Platform (GCP), and expertise in observability, including integration with monitoring and alerting tools like Sentry and Slack. This role is essential for ensuring that our systems are performant, scalable, and reliable, supporting seamless deployments and robust infrastructure management.

Key Responsibilities

Infrastructure Management: Design, deploy, and maintain our infrastructure on Google Cloud Platform (GCP) using Terraform to build reliable, secure, and scalable cloud environments.
Environment Management: Oversee development and test environments, ensuring consistent setup, data population, and availability for engineering teams. Manage synthetic test data to support safe and accurate testing processes.
Observability & Monitoring: Implement and manage observability practices using Sentry and other monitoring tools. Set up, monitor, and respond to Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for optimal system performance.
Alert Integration: Configure and integrate alerting with Slack to provide real-time notifications for performance metrics, system errors, and other critical incidents.
Metrics & Dashboards: Build and maintain dashboards to monitor key system metrics, including CPU, memory, and network usage, ensuring insights into infrastructure and application performance.
Deployment Automation: Facilitate and optimize deployment processes using CI/CD tools, working closely with developers to support a smooth release pipeline.
Feature Flag Management: Administer feature flag systems to support controlled rollouts and testing in production, collaborating with developers to manage feature toggles effectively.
Security and Compliance: Ensure the security and compliance of systems, with a focus on HIPAA and other health information standards.
Data Management & Pipelines: Manage data pipelines and tools, including Snowflake, to support scalable data ingestion, transformation, and analytics, facilitating both operational and business intelligence needs.
Business Continuity & Disaster Recovery: Develop and maintain business continuity and disaster recovery plans to ensure service resilience, implementing backup strategies and recovery testing.
Self-Service & Tooling: Develop tools, practices, and platforms to enable self-service for engineering teams, allowing them to manage infrastructure needs independently where possible. Ensure infrastructure-as-code (IaC) principles are applied throughout.
Cross-functional Collaboration: Partner with engineering, product, and support teams to ensure the infrastructure aligns with system performance goals and application needs.
Documentation: Develop and maintain comprehensive documentation for infrastructure and processes.

Proven Experience: 5+ years of experience as a DevOps Engineer or similar role.
Cloud Infrastructure & IaC Skills: Proficient with Terraform and Google Cloud Platform (GCP) for infrastructure management, with a solid understanding of infrastructure-as-code best practices.
Environment Management: Proven experience managing development and test environments, including data setup and synthetic test data for safe testing practices.
Observability & Monitoring Expertise: Hands-on experience with Sentry for application performance monitoring and alert setup; strong understanding of metrics collection for system health and performance.
Alerting & Communication Integration: Demonstrated experience in integrating alerts with Slack for streamlined, real-time notifications of SLO and performance metrics.
Performance Metrics: Strong experience setting up dashboards to visualize system performance data and monitor metrics (CPU, memory usage, etc.).
Deployment Automation: Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins) to streamline deployment processes.
Feature Flag Management: Experience managing feature flags in production (e.g., LaunchDarkly or Flagsmith) to enable gradual rollouts and A/B testing.
Data Management: Familiarity with data tools like Snowflake and experience managing data pipelines is a plus, supporting scalability in data-driven initiatives.
Self-Service Enablement: Ability to create tools and practices that enable engineering teams to be self-sufficient in their infrastructure needs and contribute to IaC practices.
Problem-Solving Skills: Analytical and proactive approach to troubleshooting, with a track record of resolving complex issues and optimizing systems.
Preferred Experience: Experience with additional observability tools like Prometheus, Grafana, or Datadog; familiarity with scripting languages like Python or Go.
Healthcare Knowledge: Experience in the healthcare industry is a plus, but not required.
Security and Compliance: Knowledge of best practices in security for cloud environments, including data encryption. Experience working within compliance frameworks (e.g., HIPAA, SOC 2) and a commitment to data privacy and security.
Business Continuity & Disaster Recovery: Experience developing business continuity and disaster recovery strategies to ensure system resilience.
Communication: Excellent verbal and written communication skills, with the ability to work effectively in a cross-functional team environment.
English Fluency: The majority of our business operations and communication are conducted in English (written and verbal)

Bonus Points

Cloud Platform Knowledge: Familiarity with AWS or Azure in addition to GCP.
Programming Skills: Proficiency in scripting or programming languages for automation tasks (e.g., Python, Go).
Observability Tools: Experience with additional observability tools like Prometheus, Grafana, or Datadog.