Senior Site Reliability Engineer - Reliability Enablement

September 17

Apply Now
Logo of Xero

Xero

Online accounting software. Connects to all things business: accountants, bookkeepers, banks, enterprise & apps.

Accounting • SaaS • Banking • Invoicing • Design

1001 - 5000

💰 $300M Post-IPO Debt on 2018-09

Description

• Investigating operational surprises and supporting teams in post incident activities. • Conducting in depth incident analysis and maximizing post incident learning across the organization • Complete short term reliability consultancy and enablement engagements such as SLO reviews and facilitating pre-mortems. • Improving on call health, uplifting observability and addressing any operational hotspots • Identifying, planning and leading implementation of reliability uplift work and initiatives • Support delivery of strategic features and initiatives with reliability and distributed systems expertise • Observing and improving rituals and practices relating to production operations, incident response and incident learning

Requirements

• Solid experience in logging, monitoring and observability of a highly distributed system • Leading incident management and response and troubleshooting efforts, including critical, complex and high severity incidents • Post incident reviews, incident analysis and learning from incidents • Experience working in a tech or product company with comparable scale and complexity • Systems thinking and thinking about how systems and components interact, how they respond to failure • Proficiency in one or more object-oriented programming languages (C#, JavaScript, Java, Python etc) or experience with infrastructure-as-code (e.g. Terraform, Cloudformation) • Experience working with cloud providers such as AWS, Azure or GCP • Experience with designing, developing and operating distributed systems and large scale software systems • Strong experience delivering technical initiatives in an operational, site reliability or platform engineering capacity • The ability to solve engineering challenges outside of your own team, including using influence rather than authority to enact change • Demonstrated experience in reliability concepts like capacity management, autoscaling, deployment and release safety, software strategies for reliability, fault tolerance and graceful failure • Experienced in implementing customer focused Service Level Objectives (SLOs) • Experience using software engineering to solve operational and reliability challenges • Understanding of human factors, safety science and resilience engineering • Experience working in environments with advanced security and networks

Apply Now

Similar Jobs

September 16

Vimeo

1001 - 5000

Design, develop, deploy, maintain Vimeo's cloud infrastructure and tooling.

🇺🇸 United States – Remote

💵 $127.8k - $196k / year

💰 $3G Private Equity Round on 2021-01

⏰ Full Time

🟠 Senior

⛑ DevOps

🗽 H1B Visa Sponsor

September 14

Cast & Crew

501 - 1000

Monitor and improve Backstage’s infrastructure, processes, and tooling for reliability.

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com

Join our Facebook group

👉 Remote Jobs Network