2 days ago
Airflow
Apache
AWS
Azure
Cloud
Distributed Systems
Google Cloud Platform
Grafana
Jenkins
Kubernetes
Node.js
OpenShift
Prometheus
Python
Terraform
Go
• Your goal will be to enhance scalability, performance, and reliability while minimizing operational overhead by leveraging your deep understanding of container orchestration (Kubernetes) and cloud platforms (AWS, Azure, GCP, Openshift, etc.); • You will collaborate closely with cross-functional teams, including CRE, Platform, and QA, to drive continuous improvement initiatives. • Upholding the highest standards of security and compliance, you will implement robust measures to protect our infrastructure and customer data. • Utilizing monitoring tools and performance metrics such as ELK and Prometheus, you will identify areas for optimization and implement strategies to enhance system performance and resource utilization for a customer's on-premise installation. • Serve as a primary point who is responsible for the overall health, performance, and capacity of our platform. • Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and growth. • Develop tools to improve our ability to rapidly deploy and effectively monitor applications in a large-scale environment. • Work closely with development teams to ensure the platform is designed with operability in mind. • Identify and lead efforts to improve automation. • Perform root cause analysis and document results in the form of post-mortems. • Write and maintain documentation around key systems and processes. • Participate in an on-call rotation with some of our customers. • Function well in a fast-paced, rapidly changing environment.
• 5 years of hands-on experience operating Kubernetes clusters in a production environment. • Experience in managing and scaling distributed systems in one of the three major cloud providers (AWS, Azure, GCP). • Strong experience with at least one Continuous Integration system, such as CircleCI or Jenkins. • Understanding of the Linux Operating System, standard networking protocols, and components. • Experience with deploying, supporting, and monitoring new and existing services, platforms, and application stacks. • Automation/Scripting experience with Shell, Python, or similar. • Familiarity with Infrastructure as Code (IaC) tools (Terraform, Cloudformation, etc.). • Strong troubleshooting and problem-solving skills.
Apply NowNovember 11
5001 - 10000
Build tools for CrowdStrike’s malware detection and response capabilities.
November 4
51 - 200
Lead Consultant at Argano manages infrastructure to ensure stability and security.
October 30
51 - 200
Cloud Infrastructure Engineer at Pearl Technologies; hybrid cloud design and management.