Site Reliability Engineering Lead

December 14

Apply Now

Description

β€’ Lead and mentor a team of SREs to ensure the reliability, availability, and performance of our large distributed web platform. β€’ Foster a collaborative and inclusive team environment, encouraging continuous learning and professional growth. β€’ Set clear goals and expectations for the SRE team, providing regular feedback and performance evaluations. β€’ Develop and implement automation strategies to streamline operations, reduce manual intervention, and improve overall system reliability. β€’ Identify opportunities for automation across the infrastructure and application lifecycle, from deployment to monitoring and incident response. β€’ Ensure that automation tools and scripts are well-documented, maintainable, and scalable. β€’ Design and implement preventive infrastructure monitoring solutions, including synthetic tests, to proactively identify and address potential issues. β€’ Develop and maintain monitoring dashboards and alerting systems to provide real-time visibility into system health and performance. β€’ Continuously improve monitoring and alerting processes to reduce false positives and ensure timely detection of critical issues. β€’ Collaborate with engineering teams to ensure that observability and resiliency requirements are met for all new and existing services. β€’ Provide guidance on best practices for logging, monitoring, and alerting to ensure comprehensive observability. β€’ Work closely with development teams to design and implement resilient architectures that can withstand failures and recover quickly. β€’ Coordinate the support of code release and go-live activities, ensuring smooth and reliable deployments. β€’ Conduct post-release reviews to identify areas for improvement and ensure that lessons learned are applied to future releases. β€’ Conduct regular performance tuning exercises to optimize system performance and ensure efficient resource utilization. β€’ Perform capacity planning to anticipate future growth and ensure that the infrastructure can scale to meet demand. β€’ Plan and execute disaster recovery exercises to validate the effectiveness of backup and recovery procedures. β€’ Stay up-to-date with industry trends and best practices in SRE, cloud computing, and automation. β€’ Continuously evaluate new tools and technologies to enhance the reliability, scalability, and efficiency of the platform. β€’ Share knowledge and insights with the team and the broader organization to promote a culture of continuous improvement and innovation.

Requirements

β€’ Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience). β€’ Strong communication and leadership skills, with the ability to work effectively in a collaborative team environment. β€’ Proven experience as an SRE or in a similar role, with a focus on large distributed web platforms. β€’ Strong expertise in Azure cloud services and infrastructure management. β€’ Proficiency in Infrastructure as Code (IaC) tools such as AWS CloudFormation/CDK, Azure Bicep/ARM templates, Terraform, or similar. β€’ Experience with container orchestration platforms like Azure Container Apps and Kubernetes. β€’ Familiarity with serverless computing frameworks such as Azure Functions or AWS Lambda. β€’ Knowledge of Content Delivery Networks (CDNs) and their configuration and management. β€’ Experience with heavy loaded SQL Server maintenance, performance monitoring and tuning β€’ Experience with messaging and streaming platforms like Azure ServiceBus, Azure EventHub, Kafka β€’ Strong scripting and automation skills using languages such as Python, Bash, or PowerShell. β€’ Experience with monitoring and observability tools such as Azure Monitor, AWS CloudWatch, Prometheus, Grafana. β€’ Excellent problem-solving skills and the ability to troubleshoot complex issues in a distributed environment.

Apply Now

Similar Jobs

Built byΒ Lior Neu-ner. I'd love to hear your feedback β€” Get in touch via DM or lior@remoterocketship.com