Tech Recruitment • Product Recruitment • Science Recruitment • Consulting • Talent Acquisition
December 20, 2023
Tech Recruitment • Product Recruitment • Science Recruitment • Consulting • Talent Acquisition
• Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability. • Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services. • Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance. • Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence. • Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency. • Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems. • Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimization. • Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.
• Bachelor's or Master's degree in Computer Science, Information Technology, or a related field. • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure. • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes). • Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines. • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack). • Scripting and automation skills (e.g., Python, Bash). • Excellent problem-solving and troubleshooting skills. • Strong communication and collaboration skills.
• Excellent salary and benefits package • Opportunity to work with cutting-edge technology • Collaborative and innovative work environment
Apply NowApril 9, 2023
March 6, 2023
January 12, 2022
Lead infrastructure design for KX's cloud service deployment in analytics.
🇬🇧 United Kingdom – Remote
💵 £60 - £120 / year
⏰ Full Time
🟡 Mid-level
🟠 Senior
⛑ DevOps & Site Reliability Engineer (SRE)
🇬🇧 UK Skilled Worker Visa Sponsor