December 3
• Monitor and troubleshoot infrastructure and application issues to ensure flawless reliability and availability of Zendesk products. • Solve and triage alerts, providing support with service incidents by alert validation, recommending resolutions, or implementing temporary fixes. • Actively participate in the incident lifecycle, ensuring swift and effective resolution of high severity service incidents. • Improve and implement monitoring tools and alerting to boost observability. • Write code to automate mitigations and improve tools. • Train and level up Junior engineers throughout Zendesk on reliability, observability, alerting, and incident response.
• Availability to work mid shift or night shift work schedule, 5 times a week including 1 weekend. • Proficiency with one of the following: Python, Javascript, Ruby, and React frameworks. • Extensive experience with AWS, infrastructure, cloud-native software design, backend systems, Kubernetes, and configuration as code. • Extensive experience with monitoring tools such as Datadog, PagerDuty, or similar tools is a plus. • 3+ years experience in Software Engineering or a Site Reliability engineering role. • Experience drafting RFCs and similar documents for review by peers across engineering. • The desire to lead, partner, and collaborate across our engineering organization.
Apply Now