Site Reliability Engineer

September 18

Apply Now

Description

β€’ Reporting to the Head of Cloud Enablement Engineering, the Site Reliability Engineer will play a critical role in driving innovation and growth for the Banking Solutions business. β€’ This role will have the opportunity to make a lasting impact on the company's digital transformation journey, drive customer-centric innovation and automation, and position the organization as a leader in the competitive digital banking landscape. β€’ Responsible for design and maintenance of monitoring solutions and alerting mechanisms for infrastructure, application performance, and user experience metrics. β€’ Implement automation tools and processes to automate routine tasks, scale infrastructure, and ensure seamless deployments, updates, and rollbacks with minimal user impact. β€’ Ensure the reliability, availability, and performance of applications and services, focusing on minimizing downtime, optimizing response times, and maintaining high availability for users. β€’ Lead incident response efforts for incidents, including identification, triage, resolution, and post-incident analysis to prevent recurrence and improve system resilience. β€’ Conduct capacity planning, performance tuning, and resource optimization for environments, collaborating with development and operations teams to meet scalability and performance goals. β€’ Collaborate with security teams to implement security best practices, perform vulnerability assessments, and ensure compliance with security standards and regulatory requirements for applications. β€’ Manage deployment pipelines, release processes, and configuration management for app deployments, ensuring consistency, reliability, and version control across environments. β€’ Identify areas for improvement in reliability, performance, and efficiency through data analysis, root cause analysis, and trend analysis, and drive initiatives to enhance system reliability and operational efficiency. β€’ Create and maintain documentation, runbooks, and knowledge base articles for operational procedures, troubleshooting guides, and best practices, and promote knowledge sharing within the team. β€’ Develop and test disaster recovery plans, backup strategies, and failover mechanisms for app services, ensuring business continuity and data integrity in case of failures or disasters. β€’ Collaborate with development, QA, DevOps, and product teams to ensure alignment on reliability goals, performance metrics, release schedules, and incident response processes. β€’ Participate in on-call rotations and provide 24/7 support for critical incidents, troubleshoot issues, and coordinate with teams for resolution, escalation, and follow-up actions as per defined SLAs.

Requirements

β€’ Proficient in development technologies, architectures, and platforms ( web, api ) to understand system complexities and performance considerations. β€’ Experience in cloud platforms (e.g., AWS, Azure, Google Cloud) and infrastructure as code (IaC) tools for managing app infrastructure and deployments. β€’ Knowledge of monitoring tools (e.g., Prometheus, Grafana, DataDog, New Relic) and logging frameworks (e.g., Splunk, SumoLogic, ELK Stack) for real-time visibility into system health, performance metrics, and user experience. β€’ Experience in incident management, including incident response, triage, root cause analysis (RCA), and post-mortem reviews to prevent recurring issues. β€’ Strong troubleshooting skills to diagnose complex technical issues in app environments, infrastructure, networking, and performance bottlenecks. β€’ Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform, Ansible) for automating routine tasks, deployments, and infrastructure management. β€’ Experience in implementing continuous integration/continuous deployment (CI/CD) pipelines for apps using tools like Jenkins, GitLab CI/CD, or Azure DevOps. β€’ Expertise in setting up monitoring solutions, configuring alerts, and creating dashboards to monitor system performance, application metrics, and user experience. β€’ Familiarity with APM (Application Performance Monitoring) tools to analyze app performance, identify bottlenecks, and optimize resource utilization. β€’ Familiarity with RUM (Real User Monitoring) for tracking and analyzing user interaction and system performance. β€’ Commitment to continuous learning, staying updated with industry trends, new technologies, and best practices in app reliability, performance, and operations. β€’ Adaptability to evolving requirements, technologies, and business needs, with a focus on driving continuous improvement and operational excellence.

Apply Now

Similar Jobs

August 21

Build and maintain infrastructure for an AI-driven platform ensuring scalability and security.

πŸ‡ΊπŸ‡Έ United States – Remote

πŸ’΅ $100k - $150k / year

⏰ Full Time

🟑 Mid-level

🟠 Senior

β›‘ DevOps & Site Reliability Engineer (SRE)

Built byΒ Lior Neu-ner. I'd love to hear your feedback β€” Get in touch via DM or lior@remoterocketship.com