December 14
πΊπΈ United States β Remote
β³ Contract/Temporary
π Senior
β DevOps & Site Reliability Engineer (SRE)
π¦ H1B Visa Sponsor
AWS
Azure
Bash
Cloud
Grafana
Jenkins
Kafka
Kubernetes
Microservices
Open Source
Prometheus
Python
SQL
Terraform
.NET
β’ Lead and mentor a team of SREs to ensure the reliability, availability, and performance of our large distributed web platform. β’ Foster a collaborative and inclusive team environment, encouraging continuous learning and professional growth. β’ Set clear goals and expectations for the SRE team, providing regular feedback and performance evaluations. β’ Develop and implement automation strategies to streamline operations, reduce manual intervention, and improve overall system reliability. β’ Identify opportunities for automation across the infrastructure and application lifecycle, from deployment to monitoring and incident response. β’ Ensure that automation tools and scripts are well-documented, maintainable, and scalable. β’ Design and implement preventive infrastructure monitoring solutions, including synthetic tests, to proactively identify and address potential issues. β’ Develop and maintain monitoring dashboards and alerting systems to provide real-time visibility into system health and performance. β’ Continuously improve monitoring and alerting processes to reduce false positives and ensure timely detection of critical issues. β’ Collaborate with engineering teams to ensure that observability and resiliency requirements are met for all new and existing services. β’ Provide guidance on best practices for logging, monitoring, and alerting to ensure comprehensive observability. β’ Work closely with development teams to design and implement resilient architectures that can withstand failures and recover quickly. β’ Coordinate the support of code release and go-live activities, ensuring smooth and reliable deployments. β’ Conduct post-release reviews to identify areas for improvement and ensure that lessons learned are applied to future releases. β’ Conduct regular performance tuning exercises to optimize system performance and ensure efficient resource utilization. β’ Perform capacity planning to anticipate future growth and ensure that the infrastructure can scale to meet demand. β’ Plan and execute disaster recovery exercises to validate the effectiveness of backup and recovery procedures. β’ Stay up-to-date with industry trends and best practices in SRE, cloud computing, and automation. β’ Continuously evaluate new tools and technologies to enhance the reliability, scalability, and efficiency of the platform. β’ Share knowledge and insights with the team and the broader organization to promote a culture of continuous improvement and innovation.
β’ Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience). β’ Strong communication and leadership skills, with the ability to work effectively in a collaborative team environment. β’ Proven experience as an SRE or in a similar role, with a focus on large distributed web platforms. β’ Strong expertise in Azure cloud services and infrastructure management. β’ Proficiency in Infrastructure as Code (IaC) tools such as AWS CloudFormation/CDK, Azure Bicep/ARM templates, Terraform, or similar. β’ Experience with container orchestration platforms like Azure Container Apps and Kubernetes. β’ Familiarity with serverless computing frameworks such as Azure Functions or AWS Lambda. β’ Knowledge of Content Delivery Networks (CDNs) and their configuration and management. β’ Experience with heavy loaded SQL Server maintenance, performance monitoring and tuning β’ Experience with messaging and streaming platforms like Azure ServiceBus, Azure EventHub, Kafka β’ Strong scripting and automation skills using languages such as Python, Bash, or PowerShell. β’ Experience with monitoring and observability tools such as Azure Monitor, AWS CloudWatch, Prometheus, Grafana. β’ Excellent problem-solving skills and the ability to troubleshoot complex issues in a distributed environment.
Apply NowNovember 7
DevOps Engineer driving automation for Sky Solutions, transforming technology since 2008.
πΊπΈ United States β Remote
β³ Contract/Temporary
π Senior
β DevOps & Site Reliability Engineer (SRE)
π¦ H1B Visa Sponsor
June 3
πΊπΈ United States β Remote
β³ Contract/Temporary
π Senior
β DevOps & Site Reliability Engineer (SRE)
August 5, 2023
πΊπΈ United States β Remote
β³ Contract/Temporary
π Senior
β DevOps & Site Reliability Engineer (SRE)
π¦ H1B Visa Sponsor
August 4, 2023
πΊπΈ United States β Remote
β³ Contract/Temporary
π Senior
β DevOps & Site Reliability Engineer (SRE)
AWS
Bitbucket
Cloud
DevOps
Docker
Git
GitHub
GitLab
Groovy
Java
Jenkins
Jira
JSON
Kubernetes
Python
ServiceNow
Spring
Spring Boot
August 4, 2023
πΊπΈ United States β Remote
β³ Contract/Temporary
π Senior
β DevOps & Site Reliability Engineer (SRE)
π¦ H1B Visa Sponsor