November 6
• Provide live operational support for multiple client software applications, monitoring services and alerts to detect critical failures, ensuring rapid restoration of services and minimal downtime. • Develop and maintain code to resolve production issues quickly, leveraging strong development skills to ensure fast service recovery and long-term system stability. • Own and resolve incidents reported by clients and internal stakeholders, adhering to client SLA and internal SLO timelines. • Troubleshoot complex incidents, perform thorough root cause analyses, and implement solutions to prevent the recurrence of issues. • Utilize a data-driven approach to prepare detailed analyses and reports, presenting findings through charts, layouts, and diagrams. • Conduct deep technical analyses of product and feature deficiencies, addressing client pain points based on actual use cases. • Develop and enhance monitoring systems to proactively detect issues, implementing robust alert mechanisms to ensure continuous system stability. • Provide expert guidance on improving operational system stability and scalability. • Lead and execute initiatives that automate processes, improving operational efficiency across LiveOps. • Facilitate postmortem meetings following incidents, documenting findings, and assigning action items for future prevention. • Collaborate with cross-functional teams to ensure rapid resolution of production issues, implementing long-term fixes. • Lead and motivate project teams, ensuring tasks are completed on schedule and that high-quality standards are consistently met. • Mentor and provide ongoing training to reliability engineers, tracking their progress and ensuring adherence to high standards. • Actively contribute to maintaining the highest quality standards as the organization continues to scale. • Participate in after-hours on-call support as part of the LiveOps rotation.
• Operationally focused with expertise in incident management and resolving live production issues • Strong debugging and troubleshooting skills, particularly in performance optimization of large-scale applications • Proven experience in building and maintaining reliable monitoring and alerting systems in high-demand environments, with a focus on production support • 7+ years of experience with .NET Framework (C#), ensuring production system stability • Strong knowledge of Kubernetes, Docker, and cloud platforms (GCP preferred) • Proficiency with monitoring tools like Prometheus, Grafana, and Kibana • Experience with incident ticketing/documentation tools like FreshDesk and Confluence • Critical thinker who can identify system weaknesses and find innovative solutions • Strong project management skills with a focus on scalability and system stability • ITIL Service Management certification (or equivalent) is highly desired, such as ITIL v3, ITIL v4, or other equivalent certifications. • Experience with PowerBI, web scraping, or Golang (nice to have)
Apply Now