Xero

Website LinkedIn All Job Openings

Accounting • SaaS • Banking • Invoicing • Design

1001 - 5000

💰 $300M Post-IPO Debt on 2018-09

Senior Site Reliability Engineer - Reliability Enablement

September 17

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗽 H1B Visa Sponsor

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Java

JavaScript

Python

Terraform

Apply Now

Xero

Website LinkedIn All Job Openings

Accounting • SaaS • Banking • Invoicing • Design

1001 - 5000

💰 $300M Post-IPO Debt on 2018-09

Description

• Investigating operational surprises and supporting teams in post incident activities. • Conducting in depth incident analysis and maximizing post incident learning across the organization • Complete short term reliability consultancy and enablement engagements such as SLO reviews and facilitating pre-mortems. • Improving on call health, uplifting observability and addressing any operational hotspots • Identifying, planning and leading implementation of reliability uplift work and initiatives • Support delivery of strategic features and initiatives with reliability and distributed systems expertise • Observing and improving rituals and practices relating to production operations, incident response and incident learning

Requirements

• Solid experience in logging, monitoring and observability of a highly distributed system • Leading incident management and response and troubleshooting efforts, including critical, complex and high severity incidents • Post incident reviews, incident analysis and learning from incidents • Experience working in a tech or product company with comparable scale and complexity • Systems thinking and thinking about how systems and components interact, how they respond to failure • Proficiency in one or more object-oriented programming languages (C#, JavaScript, Java, Python etc) or experience with infrastructure-as-code (e.g. Terraform, Cloudformation) • Experience working with cloud providers such as AWS, Azure or GCP • Experience with designing, developing and operating distributed systems and large scale software systems • Strong experience delivering technical initiatives in an operational, site reliability or platform engineering capacity • The ability to solve engineering challenges outside of your own team, including using influence rather than authority to enact change • Demonstrated experience in reliability concepts like capacity management, autoscaling, deployment and release safety, software strategies for reliability, fault tolerance and graceful failure • Experienced in implementing customer focused Service Level Objectives (SLOs) • Experience using software engineering to solve operational and reliability challenges • Understanding of human factors, safety science and resilience engineering • Experience working in environments with advanced security and networks

Apply Now