Digital Experience Monitoring • Observability • User Experience Observability • Network Observability • Application Observability
201 - 500
November 1
AWS
Azure
Backbone
Bash
Cloud
DNS
ElasticSearch
Google Cloud Platform
Grafana
Jenkins
Oracle
Prometheus
Python
Splunk
Terraform
Go
Digital Experience Monitoring • Observability • User Experience Observability • Network Observability • Application Observability
201 - 500
• Who monitors the monitoring system? A Site Reliability Engineer at Catchpoint is responsible for supporting the systems that run Catchpoint’s global monitoring platform. • In this role, you will interact directly with operations and development teams on building and maintaining automation and monitoring to ensure Catchpoint has a scalable and highly reliable system for our customers. • The role requires an operational mindset and a love of solving problems on a global scale with solutions that maintain high reliability and availability. • You’ll be exploring and making sense of systems telemetry, logs, passive monitoring and our own synthetic monitors to create an automation that controls, rolls out, and maintains our platform. • This position reports to an SRE manager. • Responsibilities: • Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement • Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Establish performance baselines, define actions and automation correlating data from multiple sources • Design, build, and maintain logging and telemetry systems that are used to manage all services. • Design, code, test, and deliver software to automate manual operational work. • Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents. • Identify application patterns and analytics in support of better service level objectives. • Deploy and maintain systems that run on multiple cloud providers (AWS, GCP, Azure, Alibaba, Tencent, Oracle, IBM) and physical systems around the world. • Be part of an on-call rotation to support production systems.
• Strong Experience/knowledge of administering application servers, web servers, and databases • Familiarity with Automation and configuration management tools (preferably terraform) • Good networking knowledge and experience with Internet Architecture (BGP, peering, DNS). • 2+ years of incident resolution experience in a large-scale operations environment. • Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, Terraform, Jenkins, etc. • 3+ years with python, bash, PowerShell, C, etc. • Virtualization experience required. • BS degree in Computer Science or related technical field involving coding or equivalent practical experience. • Appreciation of the value of diversity of opinions
Apply Now