GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming
10,000+
September 15
Ansible
Chef
Cloud
Distributed Systems
Docker
Java
Kubernetes
OpenStack
Perl
Prometheus
Puppet
Python
Ruby
Terraform
Go
GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming
10,000+
• Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting. • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand. • Work closely with peers on the team to improve the lifecycle of services – from inception and design, through deployment, operation, and refinement. • Support services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, and launch reviews. • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health, including leveraging machine learning models. • Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems by pushing for changes that improve reliability and velocity. • Practice sustainable incident response and blameless postmortems. • Be part of an on-call rotation to support production systems.
• BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience. • At least 5+ years practical experience. • Background with algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems. • Experience in one or more of the following: C/C++, Java, Python, Go, Perl or Ruby, AI/ML frameworks and methodologies. • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform. • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack. • Experience with Git, code review, pipelines, and CI/CD. • Strong debugging skills with a systematic problem-solving approach to identify complex problems.
Apply NowSeptember 14
501 - 1000
Join Solvd Inc. to enhance Kubernetes management platforms for global clients.
September 3
51 - 200
Build and maintain Beekeeper’s production infrastructure for seamless user experience.
🇵🇱 Poland – Remote
💰 $50M Series C on 2022-11
⏰ Full Time
🟠 Senior
⛑ DevOps & Site Reliability Engineer (SRE)
August 31
1001 - 5000
Infrastructure team solving complex system and network problems with automation.
🇵🇱 Poland – Remote
💵 PLN22.8k - PLN32.9k / year
⏰ Full Time
🟠 Senior
⛑ DevOps & Site Reliability Engineer (SRE)
August 28
11 - 50
Support management of private cloud environment and enhance GenAi applications for municipalities.
August 14
1001 - 5000
Build and manage highly available, distributed systems for enterprise conversational applications.
🇵🇱 Poland – Remote
💰 $2.3M Post-IPO Equity on 2012-06
⏰ Full Time
🟡 Mid-level
🟠 Senior
⛑ DevOps & Site Reliability Engineer (SRE)