techruiter.

Website LinkedIn All Job Openings

Tech Recruitment • Product Recruitment • Science Recruitment • Consulting • Talent Acquisition

11 - 50 employees

Founded 2019

🎯 Recruitment

🏢 Enterprise

🤝 B2B

Site Reliability Engineer - LLM and Machine Learning

December 20, 2023

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Bash

Cloud

Docker

GCP

Grafana

Kubernetes

Prometheus

Python

Terraform

Apply Now

techruiter.

Website LinkedIn All Job Openings

Tech Recruitment • Product Recruitment • Science Recruitment • Consulting • Talent Acquisition

11 - 50 employees

Founded 2019

🎯 Recruitment

🏢 Enterprise

🤝 B2B

Description

• Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability. • Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services. • Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance. • Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence. • Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency. • Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems. • Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimization. • Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.

Requirements

• Bachelor's or Master's degree in Computer Science, Information Technology, or a related field. • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure. • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes). • Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines. • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack). • Scripting and automation skills (e.g., Python, Bash). • Excellent problem-solving and troubleshooting skills. • Strong communication and collaboration skills.