Site Reliability Engineer - Manager

March 28

Apply Now
Logo of RunPod

RunPod

RunPod is a cloud-based platform designed to facilitate the training, fine-tuning, and deployment of AI models. It provides a globally distributed GPU cloud that enables users to seamlessly deploy their AI workloads while focusing on building machine learning applications. With features like fast pod spinning, autoscaling, and support for multiple machine learning frameworks, RunPod caters to startups, academic institutions, and enterprises alike, offering a powerful and cost-effective solution for machine learning development.

Machine Learning β€’ Artificial Intelligence β€’ Deep Learning

51 - 200 employees

Founded 2022

πŸ€– Artificial Intelligence

☁️ SaaS

πŸ”₯ Funding within the last year

πŸ’° Seed Round on 2024-05

πŸ“‹ Description

β€’ RunPod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. β€’ We are seeking an experienced and visionary Site Reliability Engineering (SRE) Manager to lead and mentor our team of highly skilled Site Reliability Engineers. β€’ As the SRE Manager, you will be responsible for overseeing the design, implementation, and maintenance of our large-scale, distributed systems across multiple data centers. β€’ You will lead a team that manages our critical infrastructure, including our GPU/AI technologies, and ensure the continuous improvement of our systems' reliability, performance, and security. β€’ Our SRE Philosophy prioritizes automation, systems thinking, continuous improvement, proactive problem solving, and scalability through code. β€’ If you are passionate about leading a team of top-tier SREs, driving technical excellence, and solving complex infrastructure challenges at scale, we want to hear from you.

🎯 Requirements

β€’ 5+ years of experience in Site Reliability Engineering or a similar role β€’ 3+ years of experience in a technical leadership or management position β€’ Deep understanding of Linux systems, containerization, virtualization, and networking technologies β€’ Strong background in managing and monitoring large-scale distributed systems and bare-metal fleets β€’ Expertise in infrastructure-as-code and configuration management tools β€’ Proficiency in at least one programming language, preferably Python or Golang β€’ Experience with cloud platforms (AWS, GCP, Azure) and their respective services β€’ Strong knowledge of monitoring, observability, and alerting systems β€’ Excellent problem-solving skills and ability to manage complex, large-scale incidents β€’ Proven track record of implementing and managing SLIs, SLOs, and SLAs β€’ Strong communication skills with the ability to convey technical concepts to both technical and non-technical stakeholders β€’ Successful completion of a background check β€’ Bachelor's or Master's degree in Computer Science, Engineering, or a related field (Preferred)

πŸ–οΈ Benefits

β€’ The competitive base pay for this position ranges from $180,000 - $210,000. Factors that may be used to determine your actual pay may include your specific job related knowledge, skills and experience. β€’ Stock options β€’ The flexibility of remote work with an inclusive, collaborative team. β€’ An opportunity to grow with a company that values innovation and user-centric design. β€’ Generous vacation policy to ensure work-life harmony and well-being. β€’ Contribute to a company with a global impact based in the US, Canada, and Europe.

Apply Now

March 28

Trilogy Federal seeks a DevOps Engineer/Security Specialist for cybersecurity and DevOps engineering expertise. Support for Department of Veterans Affairs in cloud-based solutions and DevSecOps integration.

Discover 100,000+ Remote Jobs!

Join now to unlock all jobs

Discover hidden jobs

We scan the internet everyday and find jobs not posted on LinkedIn or other job boards.

Head start against the competition

We find jobs as soon as they're posted, so you can apply before everyone else.

Be the first to know

Daily emails with new job openings straight to your inbox.

Choose your membership

Loved by 10,000+ remote workers
πŸŽ‰$6 / week

Cancel anytime

MOST POPULAR
πŸ₯³$18 / month
$24
Save 25% vs weekly

Cancel anytime

BEST VALUE
πŸ₯°$54 / year
$216
Save 75% vs monthly

Cancel anytime

Wall of Love

Frequently asked questions

We use powerful scraping tech to scan the internet for thousands of remote jobs daily. It operates 24/7 and costs us to operate, so we charge for access to keep the site running.

Of course! You can cancel your subscription at any time with no hidden fees or penalties. Once canceled, you’ll still have access until the end of your current billing period.

Other job boards only have jobs from companies that pay to post. This means that you miss out on jobs from companies that don't want to pay. On the other hand, Remote Rocketship scrapes the internet for jobs and doesn't accept payments from companies. This means we have thousands more jobs!

New jobs are constantly being posted. We check each company website every day to ensure we have the most up-to-date job listings.

Yes! We’re always looking to expand our listings and appreciate any suggestions from our community. Just send an email to Lior@remoterocketship.com. I read every request.

Remote Rocketship is a solo project by me, Lior Neu-ner. I built this website for my wife when she was looking for a job! She was having a hard time finding remote jobs, so I decided to build her a tool that would search the internet for her.

Why I created Remote Rocketship

Choose your membership

Loved by 10,000+ remote workers
πŸŽ‰$6 / week

Cancel anytime

MOST POPULAR
πŸ₯³$18 / month
$24
Save 25% vs weekly

Cancel anytime

BEST VALUE
πŸ₯°$54 / year
$216
Save 75% vs monthly

Cancel anytime

Built by Lior Neu-ner. I'd love to hear your feedback β€” Get in touch via DM or lior@remoterocketship.com