Scicom Infrastructure Services

Website LinkedIn All Job Openings

11 - 50 employees

🏢 Enterprise

☁️ SaaS

🤝 B2B

Site Reliability Engineering Lead

December 14

🇺🇸 United States – Remote

⏳ Contract/Temporary

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Azure

Bash

Cloud

Grafana

Jenkins

Kafka

Kubernetes

Microservices

Open Source

Prometheus

Python

SQL

Terraform

.NET

Apply Now

Scicom Infrastructure Services

Website LinkedIn All Job Openings

11 - 50 employees

🏢 Enterprise

☁️ SaaS

🤝 B2B

Description

• Lead and mentor a team of SREs to ensure the reliability, availability, and performance of our large distributed web platform. • Foster a collaborative and inclusive team environment, encouraging continuous learning and professional growth. • Set clear goals and expectations for the SRE team, providing regular feedback and performance evaluations. • Develop and implement automation strategies to streamline operations, reduce manual intervention, and improve overall system reliability. • Identify opportunities for automation across the infrastructure and application lifecycle, from deployment to monitoring and incident response. • Ensure that automation tools and scripts are well-documented, maintainable, and scalable. • Design and implement preventive infrastructure monitoring solutions, including synthetic tests, to proactively identify and address potential issues. • Develop and maintain monitoring dashboards and alerting systems to provide real-time visibility into system health and performance. • Continuously improve monitoring and alerting processes to reduce false positives and ensure timely detection of critical issues. • Collaborate with engineering teams to ensure that observability and resiliency requirements are met for all new and existing services. • Provide guidance on best practices for logging, monitoring, and alerting to ensure comprehensive observability. • Work closely with development teams to design and implement resilient architectures that can withstand failures and recover quickly. • Coordinate the support of code release and go-live activities, ensuring smooth and reliable deployments. • Conduct post-release reviews to identify areas for improvement and ensure that lessons learned are applied to future releases. • Conduct regular performance tuning exercises to optimize system performance and ensure efficient resource utilization. • Perform capacity planning to anticipate future growth and ensure that the infrastructure can scale to meet demand. • Plan and execute disaster recovery exercises to validate the effectiveness of backup and recovery procedures. • Stay up-to-date with industry trends and best practices in SRE, cloud computing, and automation. • Continuously evaluate new tools and technologies to enhance the reliability, scalability, and efficiency of the platform. • Share knowledge and insights with the team and the broader organization to promote a culture of continuous improvement and innovation.

Requirements

• Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience). • Strong communication and leadership skills, with the ability to work effectively in a collaborative team environment. • Proven experience as an SRE or in a similar role, with a focus on large distributed web platforms. • Strong expertise in Azure cloud services and infrastructure management. • Proficiency in Infrastructure as Code (IaC) tools such as AWS CloudFormation/CDK, Azure Bicep/ARM templates, Terraform, or similar. • Experience with container orchestration platforms like Azure Container Apps and Kubernetes. • Familiarity with serverless computing frameworks such as Azure Functions or AWS Lambda. • Knowledge of Content Delivery Networks (CDNs) and their configuration and management. • Experience with heavy loaded SQL Server maintenance, performance monitoring and tuning • Experience with messaging and streaming platforms like Azure ServiceBus, Azure EventHub, Kafka • Strong scripting and automation skills using languages such as Python, Bash, or PowerShell. • Experience with monitoring and observability tools such as Azure Monitor, AWS CloudWatch, Prometheus, Grafana. • Excellent problem-solving skills and the ability to troubleshoot complex issues in a distributed environment.

Apply Now