Senior Software Engineer, Site Reliability

October 25

Apply Now
Logo of Gretel.ai

Gretel.ai

Generative AI • Synthetic Data • Machine Learning • Privacy • AI

Description

• Ensure the safety, security, and reliability of our cloud infrastructure • Build and maintain Gretel's observability stack • Scale systems sustainably with automation • Manage and lead incident response, recovery, and postmortems • Partner with software engineers to troubleshoot production issues • Build tools and frameworks for Gretel engineers • Ship complex ML/AI models with applied science and engineering teams

Requirements

• Experience with at least one cloud platform (we use AWS heavily) • Experience with Docker and Kubernetes • Ability to write software and tools in Python or Go • Experience with monitoring, alerting and operations • Experience operating highly available distributed systems in the cloud • Experience identifying, diagnosing, and responding to operational outages • Experience with infrastructure as code (Terraform, CloudFormation, etc) • Experience with build systems such as Bazel • Experiencing shipping application with complex dependencies (Pytorch, Tensorflow) • Software engineering skills beyond script writing (TDD, design patterns, etc) • Experience with DevOps or CI/CD pipelines

Apply Now

Similar Jobs

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com