Site Reliability Engineer - Remote

Yesterday

Apply Now
Logo of Paymentology

Paymentology

Banking • Payments • Cards • PCI • Multi Currency

201 - 500 employees

Founded 2015

💳 Fintech

💸 Finance

☁️ SaaS

💰 Seed Round on 2016-01

Description

• Build software that enhances Paymentology services' scalability and reliability. • Ensure platform services meet required uptime and service quality levels. • Contribute to the design of reliable cloud infrastructure and implement reusable cloud-uptime components as code. • Regularly review and optimise SRE practices, tools, and methodologies to enhance overall system reliability and team efficiency. • Contribute to the design, implementation, and maintenance of observability and monitoring solutions to track the platform health. • Develop and implement automation scripts and tools to streamline operations and reduce manual interventions. • Enable product teams to self-serve by participating in the development of a developer platform. • Play an active role with the incident response teams, diagnosing and resolving production issues quickly. • Support product teams in building services that adhere to security and quality standards. • Work closely with engineering, operations, and product teams to ensure reliability is considered throughout the software development lifecycle.

Requirements

• Bachelor’s Degree in Computer Science, Information Technology, or related field. • A minimum of 3 years in a dedicated SRE role, as well as 5+ years of prior software development experience. • Comprehensive understanding of large-scale distributed platform architecture. • Extensive hands-on cloud experience, particularly with AWS. • Proven experience developing scalable, modular infrastructure-as-code projects using tools such as Terraform, CloudFormation, Puppet, and Ansible. • Practical experience with Docker and container orchestrators, including AWS ECS & EKS, and Kubernetes. • Experience in administering or integrating identity management systems for SSO, including AWS IAM, Okta, and Active Directory. • Experience with disaster recovery and redundancy strategies in both cloud and on-premises environments. • Proficiency with leading monitoring tools, such as Datadog, Splunk, Prometheus, Grafana, ELK Stack, and New Relic. • Programming expertise, especially in systems programming languages (e.g., Java, Kotlin, Scala) and databases (e.g., SQL Server, PostgreSQL). • Familiarity with industry-leading CI/CD tools such as Jenkins, GitHub Actions, Gitlab CI, CodePipelines, CircleCI, and ArgoCD. • Track record of achieving platform-level and end-to-end SLIs, SLOs, and SLAs, and fostering accountability. • Ability to navigate complex situations and lead effective post-incident reviews (PIRs). • Knowledge of implementing solutions to reduce Mean Time to Identify (MTTI) and Mean Time to Resolve (MTTR). • Expertise in implementing best practices for load balancing, fault tolerance, and resource allocation to maintain service quality and efficiency at scale. • Understanding of security best practices within cloud environments. • Exceptional communication skills in English.

Benefits

• Full-time remote position with flexible hours. • An inclusive and supportive work environment that values diversity. • A chance to work on cutting-edge technology projects that make a difference. • Opportunities for continuous learning and development.

Apply Now
Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com