Senior Site Reliability Engineer – GCP

🕒 6 days ago

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Devsu

Devsu

51 - 200 employees

🤝 B2B

🏢 Enterprise

☁️ SaaS

B2B • Enterprise • SaaS

Devsu is a technology services company that provides a range of strategic solutions to help clients scale their operations, enhance efficiency, and drive innovation. The company specializes in staff augmentation, dedicated teams, and custom development, leveraging elite tech talent to meet the specific needs of each project. Devsu offers services in quality assurance, cloud engineering, AI prototyping, and data analytics, among others, ensuring high-quality and scalable solutions. By acting as an integrated partner, Devsu helps businesses achieve faster development cycles and better project outcomes with timezone-aligned teams. With a focus on cutting-edge technology and streamlined processes, Devsu empowers businesses to transform their vision into reality.

📋 Description

• We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP). • This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments. • As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required. • ResponsibilitiesMonitoring & Observability (Core Focus) • - Own and operate the monitoring and observability stack across on-prem and GCP environments • - Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications • - Define, tune, and maintain alerts to ensure high signal-to-noise ratio • - Establish observability standards and best practices across teams • - Improve visibility into system health, performance, and reliability • Site Reliability Engineering • - Apply SRE principles to improve availability, performance, and resilience • - Define and track SLIs, SLOs, and error budgets • - Participate in on-call rotations and SEV incident response • - Lead or contribute to incident investigations and root cause analysis (RCA) • - Drive preventative actions to reduce repeat incidents • Kubernetes & Platform Reliability • - Support and monitor Kubernetes environments (GKE and on-prem clusters) • - Monitor cluster health, capacity, and resource utilization • - Troubleshoot platform-level issues impacting application reliability • - Collaborate with Platform and Engineering teams on reliability improvements • Secondary Responsibilities (Backup Application Support) • - These responsibilities are activated as needed, not part of day-to-day operations. • - Provide L2/L3 application support coverage during: • - Support team resource shortages • - High-severity incidents (SEVs) • - Peak support periods or escalations • - Triage and troubleshoot application issues using existing runbooks and dashboards • - Collaborate with Application Support and Engineering teams during incidents • - Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

🎯 Requirements

• - Strong experience as a **Site Reliability Engineer or Reliability Engineer** • - Deep hands-on expertise with **Grafana **(dashboards, alerting, troubleshooting) • - Solid experience with monitoring and observability systems • - Production experience operating **Kubernetes **environments • - Experience supporting systems in **GCP **and on-prem environments (mandatory) • - Strong **Linux **systems and troubleshooting skills • - Fluent **English **(written and spoken). • - Ability to work in** PST time zone.** • - Ability to participate in an **on-call rotation **that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule. • Technology Stack: • - Observability: Grafana, Prometheus, logging platforms • - Containers: Kubernetes (GKE and on-prem) • - Cloud: Google Cloud Platform (GCP) • - Operations: Linux, networking, infrastructure monitoring • - Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents) • Nice to have: • - Experience supporting application teams during SEV incidents • - Knowledge of capacity planning and performance tuning • - Scripting skills (Python, Bash, etc.) • - Experience with hybrid infrastructure environments

🏖️ Benefits

• At Devsu, we believe in creating an environment where you can thrive both personally and professionally. By joining our team, you’ll enjoy: • - A stable, long-term contract with opportunities for career growth • - Private health insurance • - A remote-friendly culture that promotes work-life balance • - Continuous training, mentorship, and learning programs to keep you at the forefront of the industry • - Free access to AI training resources and state-of-the-art AI tools to elevate your daily work • - A flexible Paid Time Off (PTO) policy as well as paid holiday days • - Challenging, world-class software projects for clients in the US and LatAm • - Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment • Join Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.

Apply Now

Similar Jobs

🕒 May 28

Compass

10,000+ employees

🏠 Real Estate

📱 Media

AWS DevOps Engineer supporting clients in adopting AWS infrastructure solutions and developing cloud architectures. Involves building, managing, and automating scalable cloud solutions.

🗣️🇧🇷🇵🇹 Portuguese Required

AWS

Python

Terraform

🕒 May 27

Oowlish

51 - 200

🤝 B2B

💳 Fintech

Senior DevOps / Platform Engineer working on a mission-critical platform in remote collaboration. Join a vibrant team developing digital solutions for clients in the US and Europe.

AWS

Cloud

Docker

Kubernetes

React

Terraform

TypeScript

🕒 May 27

Oowlish

51 - 200

🤝 B2B

💳 Fintech

Senior DevOps / Platform Engineer at Oowlish collaborating with clients to build digital solutions. Joining a high-impact engineering team focused on scalability and infrastructure excellence.

🗣️🇧🇷🇵🇹 Portuguese Required

AWS

Cloud

Docker

Kubernetes

React

Terraform

TypeScript

🕒 May 27

Oowlish

51 - 200

🤝 B2B

💳 Fintech

Join Oowlish as a DevOps & Site Reliability Engineer to optimize cloud infrastructure. Collaborate with teams on deployment and system reliability in an AI-focused SaaS startup.

🗣️🇧🇷🇵🇹 Portuguese Required

AWS

Azure

Cloud

Docker

Google Cloud Platform

Grafana

Jenkins

Kubernetes

Prometheus

🕒 May 27

Analista DevOps developing and evolving cloud infrastructure and automation at Korp. Engaging in complex technology challenges and collaborative solutions in a dynamic environment.

🗣️🇧🇷🇵🇹 Portuguese Required

Ansible

Cloud

Grafana

Jenkins

Kubernetes

Linux