Catchpoint

Website LinkedIn All Job Openings

Digital Experience Monitoring • Observability • User Experience Observability • Network Observability • Application Observability

201 - 500

Site Reliability Engineer

November 1

🇹🇷 Turkey – Remote

⏰ Full Time

🟢 Junior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Backbone

Bash

Cloud

DNS

ElasticSearch

Google Cloud Platform

Grafana

Jenkins

Oracle

Prometheus

Python

Splunk

Terraform

Apply Now

Catchpoint

Website LinkedIn All Job Openings

Digital Experience Monitoring • Observability • User Experience Observability • Network Observability • Application Observability

201 - 500

Description

• Who monitors the monitoring system? A Site Reliability Engineer at Catchpoint is responsible for supporting the systems that run Catchpoint’s global monitoring platform. • In this role, you will interact directly with operations and development teams on building and maintaining automation and monitoring to ensure Catchpoint has a scalable and highly reliable system for our customers. • The role requires an operational mindset and a love of solving problems on a global scale with solutions that maintain high reliability and availability. • You’ll be exploring and making sense of systems telemetry, logs, passive monitoring and our own synthetic monitors to create an automation that controls, rolls out, and maintains our platform. • This position reports to an SRE manager. • Responsibilities: • Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement • Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Establish performance baselines, define actions and automation correlating data from multiple sources • Design, build, and maintain logging and telemetry systems that are used to manage all services. • Design, code, test, and deliver software to automate manual operational work. • Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents. • Identify application patterns and analytics in support of better service level objectives. • Deploy and maintain systems that run on multiple cloud providers (AWS, GCP, Azure, Alibaba, Tencent, Oracle, IBM) and physical systems around the world. • Be part of an on-call rotation to support production systems.

Requirements

• Strong Experience/knowledge of administering application servers, web servers, and databases • Familiarity with Automation and configuration management tools (preferably terraform) • Good networking knowledge and experience with Internet Architecture (BGP, peering, DNS). • 2+ years of incident resolution experience in a large-scale operations environment. • Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, Terraform, Jenkins, etc. • 3+ years with python, bash, PowerShell, C, etc. • Virtualization experience required. • BS degree in Computer Science or related technical field involving coding or equivalent practical experience. • Appreciation of the value of diversity of opinions

Apply Now

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com