13 hours ago
• The purpose of the Site Reliability Engineer (SRE) role is to ensure the stability, scalability, and performance of production systems while driving improvements in overall system reliability and operational efficiency. • By bridging the gap between development and operations, the SRE role focuses on creating a resilient infrastructure through automation, monitoring, and proactive incident management. • The SRE is responsible for designing and implementing tools and processes that enhance the reliability of applications, reduce downtime, and optimize system performance. • They work to establish best practices for high availability, incident response, and continuous improvement, ensuring seamless user experiences and aligning system operations with business objectives. • The SRE plays a critical role in both preventing and rapidly resolving issues, contributing to a stable, scalable, and reliable technology ecosystem. • Design, implement, and maintain highly available infrastructure, focusing on failover strategies, redundancy, and scalability. • Develop and maintain Infrastructure as Code (IaC) scripts using tools like Terraform, Ansible, or CloudFormation. • Set up and manage monitoring and alerting systems to proactively detect issues (using tools like Prometheus, Grafana, or Datadog). • Automate repetitive tasks, deployments, and infrastructure provisioning to improve efficiency and reduce human error. • Conduct performance tuning and optimizations across infrastructure, applications, and databases to improve responsiveness and reduce latency. • Work closely with security teams to ensure compliance with regulatory standards and address vulnerabilities promptly and implement security best practices across infrastructure and applications to protect systems and data. • Collaborate with development teams to optimize applications and integrate reliability into the software development lifecycle. • Partner with DevOps to improve CI/CD pipelines, streamline releases, and enhance build and deployment automation. • Advocate for Site Reliability Engineering principles and educate teams on reliability best practices, monitoring, and error handling Implement and track SLAs, SLOs, and error budgets, continuously assessing and improving reliability.
• Infrastructure as Code (IaC): Proficiency with IaC tools such as Terraform, Ansible, CloudFormation, or similar for automating infrastructure provisioning. • Cloud Platforms: Strong experience with cloud providers (Azure) and services such Kubernetes (EKS/GKE/AKS). • Monitoring and Alerting: Hands-on experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, New Relic, or similar). • Scripting and Automation: Proficiency in scripting languages like Python, Bash, or PowerShell for automation and tooling. • CI/CD and DevOps: Familiarity with CI/CD pipelines and tools (Azure Devops, Bamboo or Octopus), and experience implementing continuous delivery and deployment practices. • Incident Management: Experience with troubleshooting, root cause analysis, and leading incident response efforts. • Strong skills of performance Optimization • Ability to analyze complex systems • Understanding security practices
• Competitive salary synonymous with skills and experience • Performance and bonus structure dependent on achievement of set targets and personal performance • Consultancy contract (B2B) offering paid time off
Apply Now