September 24
• Ensure platform reliability: Lead efforts to enhance the reliability and availability of our digital business card platform, ensuring users have a seamless experience when sharing and managing their information. • Monitor and optimize performance: Continuously improve platform performance, making sure that it scales efficiently and remains responsive as our user base grows. • Incident detection and response: Implement robust monitoring and alerting systems to detect and resolve issues swiftly, minimizing downtime for users during critical networking moments. • Collaborate with cross-functional teams: Work with product, development, and operations teams to integrate reliability engineering into the product lifecycle, ensuring that reliability is considered from design through deployment. • Automation and scaling: Automate manual processes and optimize system scalability, reducing human intervention and ensuring the platform remains stable under increased user demand. • Leadership and mentoring: Mentor junior engineers in reliability best practices, fostering a culture of reliability across engineering teams. • Post-incident analysis: Perform root cause analysis for incidents and outages, driving initiatives to prevent future occurrences and improve system resiliency.
• 8+ years experience in site reliability engineering within SaaS or digital products. • Experience with cloud platforms (AWS, GCP, Azure), Kubernetes, Docker, Terraform, and infrastructure-as-code. • Strong expertise in automating workflows with Typescript, Node or similar programming languages to improve efficiency and system resilience. • Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog) to implement effective observability and alerting systems. • Demonstrated ability to lead incident response processes, manage critical outages, and implement long-term improvements. • Excellent communication skills and a collaborative mindset for working with cross-functional teams.
Apply Now