Join our Facebook group

👉 Remote Jobs Network

Site Reliability Engineer

April 12

🇦🇷 Argentina – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

👨🏻‍🔧 Site Reliability Engineer (SRE)

Apply Now
Logo of Tyk

Tyk

Open Source #API gateway & #APImanagement platform. We're on a mission to connect every system in the world.

API Management • API Gateways • Authentication Provider • API Consultancy • Open Source

51 - 200

Description

• Proactive Monitoring: Ensure our production Cloud environment operates within defined SLAs through vigilant monitoring and proactive issue resolution • Alerting and Monitoring: Collaborate with Senior SRE to identify opportunities for building proactive alerting and monitoring systems; implement solutions to enhance system reliability • Performance Metrics: Contribute to defining key performance metrics for Cloud services, enabling performance improvements and success measurement • Solutions Development: Propose and develop solutions to maintain and enhance key performance indicators (KPIs) across our Cloud infrastructure • Data Analysis: Gather and analyse metrics from operating systems and applications to optimise system performance and expedite fault resolution • Innovation: Drive innovation by optimising system and infrastructure performance, anticipating customer needs, and proactively addressing scaling demands • Scalability: Work closely with commercial functions to optimise our platform for scalability and meet growing customer demands • Cloud Infrastructure: Analyse and ensure the automation, scalability, and efficient management of our Cloud infrastructure • Automation: Execute automation for known cloud operations tasks and create new automation solutions to streamline processes • Software Development: Design, write, and deliver software and automation solutions to enhance the availability, scalability, latency, and efficiency of our PaaS services • Root Cause Analysis: Participate in blame-free root cause analysis meetings to promote learning and continuous system improvement in the event of production system incidents • Documentation: Create and contribute to policies and runbooks to ensure that operational processes are well-documented and consistently followed • On-call Support: Provide on-call support, ensuring our Cloud services follow a 24/7 model by promptly responding to alerts, meeting SLAs, and automating root cause analysis • Upgrades and Migrations: Plan and execute software upgrades, including Kubernetes versions. Manage and communicate migrations from Classic Cloud to the new Cloud platform

Requirements

• Strong collaboration skills • Launching and operating production Kubernetes clusters • Designing and operating infrastructure on AWS and other providers • Operating MongoDB (or other document database) clusters • Operating Redis (or other key-value storage) clusters • Administering Linux servers • Maintaining distributed software • Operating Prometheus and Grafana • Operating logging collection and analysis system • Working hours within 16:00pm - 4:00am UTC

Benefits

• Unlimited paid holidays • Remote working from anywhere in the world • Employee share scheme • Generous maternity and paternity leave • Volunteering Days • Company retreats • Employee Wellbeing platform

Apply Now
Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com