November 12
🇧🇷 Brazil – Remote
⏳ Contract/Temporary
🟡 Mid-level
🟠 Senior
⛑ DevOps & Site Reliability Engineer (SRE)
• Atria is a membership-based preventive health care practice delivering cutting-edge primary and specialty care from the comfort of your home, at our practices in Palm Beach and New York, or wherever you are in the world. • We bring together a multidisciplinary team of renowned, in-house physicians to provide proactive, preventive, and precision-based care for Atria members and their families. • We aim to optimize the lifespan and healthspan of all our members through meticulous screening and tailored interventions to prevent, reverse, or manage all major chronic diseases. • Each member’s care is led by a dedicated Chief Medical Officer who collaborates on your behalf with specialists in cardiology, neurology, pediatrics, gynecology, endocrinology, performance and movement, and more. • We are seeking a proactive and experienced DevOps Engineer to join our dynamic team. The ideal candidate will have in-depth experience with infrastructure-as-code tools like Terraform, cloud infrastructure management on Google Cloud Platform (GCP), and expertise in observability, including integration with monitoring and alerting tools like Sentry and Slack. • This role is essential for ensuring that our systems are performant, scalable, and reliable, supporting seamless deployments and robust infrastructure management. • Key Responsibilities: Design, deploy, and maintain our infrastructure on Google Cloud Platform (GCP) using Terraform to build reliable, secure, and scalable cloud environments. • Oversee development and test environments, ensuring consistent setup, data population, and availability for engineering teams. Manage synthetic test data to support safe and accurate testing processes. • Implement and manage observability practices using Sentry and other monitoring tools. Set up, monitor, and respond to Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for optimal system performance. • Configure and integrate alerting with Slack to provide real-time notifications for performance metrics, system errors, and other critical incidents. • Build and maintain dashboards to monitor key system metrics, including CPU, memory, and network usage, ensuring insights into infrastructure and application performance. • Facilitate and optimize deployment processes using CI/CD tools, working closely with developers to support a smooth release pipeline. • Administer feature flag systems to support controlled rollouts and testing in production, collaborating with developers to manage feature toggles effectively. • Ensure the security and compliance of systems, with a focus on HIPAA and other health information standards. • Manage data pipelines and tools, including Snowflake, to support scalable data ingestion, transformation, and analytics, facilitating both operational and business intelligence needs. • Develop and maintain business continuity and disaster recovery plans to ensure service resilience, implementing backup strategies and recovery testing. • Develop tools, practices, and platforms to enable self-service for engineering teams, allowing them to manage infrastructure needs independently where possible. • Partner with engineering, product, and support teams to ensure the infrastructure aligns with system performance goals and application needs. • Develop and maintain comprehensive documentation for infrastructure and processes.
• Proven Experience: 5+ years of experience as a DevOps Engineer or similar role. • Cloud Infrastructure & IaC Skills: Proficient with Terraform and Google Cloud Platform (GCP) for infrastructure management, with a solid understanding of infrastructure-as-code best practices. • Environment Management: Proven experience managing development and test environments, including data setup and synthetic test data for safe testing practices. • Observability & Monitoring Expertise: Hands-on experience with Sentry for application performance monitoring and alert setup; strong understanding of metrics collection for system health and performance. • Alerting & Communication Integration: Demonstrated experience in integrating alerts with Slack for streamlined, real-time notifications of SLO and performance metrics. • Performance Metrics: Strong experience setting up dashboards to visualize system performance data and monitor metrics (CPU, memory usage, etc.). • Deployment Automation: Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins) to streamline deployment processes. • Feature Flag Management: Experience managing feature flags in production (e.g., LaunchDarkly or Flagsmith) to enable gradual rollouts and A/B testing. • Data Management: Familiarity with data tools like Snowflake and experience managing data pipelines is a plus, supporting scalability in data-driven initiatives. • Self-Service Enablement: Ability to create tools and practices that enable engineering teams to be self-sufficient in their infrastructure needs and contribute to IaC practices. • Problem-Solving Skills: Analytical and proactive approach to troubleshooting, with a track record of resolving complex issues and optimizing systems. • Preferred Experience: Experience with additional observability tools like Prometheus, Grafana, or Datadog; familiarity with scripting languages like Python or Go. • Healthcare Knowledge: Experience in the healthcare industry is a plus, but not required. • Security and Compliance: Knowledge of best practices in security for cloud environments, including data encryption. Experience working within compliance frameworks (e.g., HIPAA, SOC 2) and a commitment to data privacy and security. • Business Continuity & Disaster Recovery: Experience developing business continuity and disaster recovery strategies to ensure system resilience. • Communication: Excellent verbal and written communication skills, with the ability to work effectively in a cross-functional team environment. • English Fluency: The majority of our business operations and communication are conducted in English (written and verbal)
Apply Now