Catchpoint

Website LinkedIn All Job Openings

Digital Experience Monitoring • Observability • User Experience Observability • Network Observability • Application Observability

201 - 500

Site Reliability Engineer

November 1

🇮🇳 India – Remote

⏰ Full Time

🟢 Junior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Backbone

Bash

Cloud

DNS

ElasticSearch

Google Cloud Platform

Grafana

Jenkins

Oracle

Prometheus

Python

Splunk

Terraform

Apply Now

Catchpoint

Website LinkedIn All Job Openings

Digital Experience Monitoring • Observability • User Experience Observability • Network Observability • Application Observability

201 - 500

Description

• Who monitors the monitoring system? A Site Reliability Engineer at Catchpoint is responsible for supporting the systems that run Catchpoint’s global monitoring platform. In this role, you will interact directly with operations and development teams on building and automating infrastructure (IaC) deployment at scale, then monitoring it to ensure Catchpoint has a scalable and highly reliable system for our customers. • What will success look like in this position? The role requires an operational mindset and a love of solving problems on a global scale with solutions that ensure high reliability and availability. You’ll be exploring and making sense of systems telemetry, logs, passive monitoring and using our own synthetic monitors to create an automation that controls, rolls out, and maintains our platform. • Responsibilities include defining and refining the whole service lifecycle, measuring and monitoring availability, latency, overall system health, designing logging and telemetry systems, automating manual operational work, troubleshooting priority incidents, identifying application patterns for better service objectives, and supporting production systems on an on-call rotation.

Requirements

• Strong Experience/knowledge of administering application servers, web servers, and databases. • Familiarity with Infrastructure Automation, configuration management and CI/CD tools (preferably terraform) • Experience with multiple cloud platforms (AWS, GCP, Azure) • Good networking knowledge and experience with Internet Architecture (BGP, peering, DNS). • 2+ years of incident resolution experience in a large-scale operations environment. • Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, Terraform, Jenkins, etc. • 3+ years programming experience with python, bash, PowerShell, C, etc. • Virtualization experience required. • BS degree in Computer Science or related technical field involving coding or equivalent practical experience. • Appreciation of the value of diversity of opinions

Apply Now