Yesterday
🇬🇧 United Kingdom – Remote
⏳ Contract/Temporary
🟠 Senior
🔴 Lead
⛑ DevOps & Site Reliability Engineer (SRE)
Ansible
AWS
Cloud
Docker
Grafana
Java
Jenkins
Kotlin
Kubernetes
Postgres
Prometheus
Puppet
Scala
Splunk
SQL
Terraform
Go
• Build software that enhances Paymentology services' scalability and reliability. • Ensure platform services meet required uptime and service quality levels. • Contribute to the design of reliable cloud infrastructure and implement reusable cloud-uptime components as code. • Regularly review and optimise SRE practices, tools, and methodologies to enhance overall system reliability and team efficiency. • Contribute to the design, implementation, and maintenance of observability and monitoring solutions to track the platform health. • Develop and implement automation scripts and tools to streamline operations and reduce manual interventions. • Enable product teams to self-serve by participating in the development of a developer platform. • Play an active role with the incident response teams, diagnosing and resolving production issues quickly. • Support product teams in building services that adhere to security and quality standards. • Work closely with engineering, operations, and product teams to ensure reliability is considered throughout the software development lifecycle.
• Bachelor’s Degree in Computer Science, Information Technology, or related field. • A minimum of 3 years in a dedicated SRE role, as well as 5+ years of prior software development experience. • Comprehensive understanding of large-scale distributed platform architecture. • Extensive hands-on cloud experience, particularly with AWS. • Proven experience developing scalable, modular infrastructure-as-code projects using tools such as Terraform, CloudFormation, Puppet, and Ansible. • Practical experience with Docker and container orchestrators, including AWS ECS & EKS, and Kubernetes. • Experience in administering or integrating identity management systems for SSO, including AWS IAM, Okta, and Active Directory. • Experience with disaster recovery and redundancy strategies in both cloud and on-premises environments. • Proficiency with leading monitoring tools, such as Datadog, Splunk, Prometheus, Grafana, ELK Stack, and New Relic. • Programming expertise, especially in systems programming languages (e.g., Java, Kotlin, Scala) and databases (e.g., SQL Server, PostgreSQL). • Familiarity with industry-leading CI/CD tools such as Jenkins, GitHub Actions, Gitlab CI, CodePipelines, CircleCI, and ArgoCD. • Track record of achieving platform-level and end-to-end SLIs, SLOs, and SLAs, and fostering accountability. • Ability to navigate complex situations and lead effective post-incident reviews (PIRs). • Knowledge of implementing solutions to reduce Mean Time to Identify (MTTI) and Mean Time to Resolve (MTTR). • Expertise in implementing best practices for load balancing, fault tolerance, and resource allocation to maintain service quality and efficiency at scale. • Understanding of security best practices within cloud environments. • Exceptional communication skills in English.
• Full-time remote position with flexible hours. • An inclusive and supportive work environment that values diversity. • A chance to work on cutting-edge technology projects that make a difference. • Opportunities for continuous learning and development.
Apply Now