We're looking for a Senior DevOps Engineer to help design, build, and optimise the cloud infrastructure powering our machine learning and data platforms. You'll play a critical role in taking AI models from research to production, ensuring scalable deployments, real-time monitoring, and highly reliable infrastructure across our Google Cloud Platform (GCP) environment.
Working closely with Data Scientists, ML Engineers, Product, and Security teams, you'll drive automation, improve platform performance, and help build the infrastructure that powers AI experiences for millions of users worldwide. This role also includes technical leadership, mentoring engineers, and helping shape engineering best practices across the organisation.
What You'll Do
- Design, build, and automate cloud infrastructure on Google Cloud Platform (GCP).
- Develop Infrastructure as Code using Terraform and Ansible.
- Build and maintain CI/CD pipelines for machine learning models and data workflows using Jenkins and Vertex AI.
- Design and manage scalable real-time and batch data pipelines using BigQuery, BigTable, Dataflow, Composer, Pub/Sub, and Cloud Run.
- Implement monitoring and observability for AI models, including model drift, bias, and performance monitoring.
- Optimise AI inference performance to improve latency, scalability, and cost efficiency.
- Ensure the reliability, availability, and security of production ML and data platforms.
- Establish best practices for infrastructure, deployment, monitoring, logging, and security.
- Troubleshoot complex production issues across cloud infrastructure and ML pipelines.
- Ensure compliance with data governance and regulatory requirements.
- Mentor DevOps engineers, lead technical initiatives, and support sprint planning and delivery.
- Conduct code reviews and promote engineering best practices across the team.
- Partner with ML, Data, Product, and Security teams to align infrastructure with business goals.
What We're Looking For
- 5+ years' experience in DevOps, Platform Engineering, or Cloud Infrastructure, ideally supporting ML and data platforms.
- Experience leading technical projects and mentoring engineering teams.
- Strong hands-on experience with Google Cloud Platform (GCP), including BigQuery, Dataflow, Vertex AI, Cloud Run, and Pub/Sub.
- Proven experience with Terraform (Ansible experience is a plus).
- Strong knowledge of Docker, Kubernetes, and GKE.
- Experience building and maintaining CI/CD pipelines using Jenkins.
- Solid understanding of monitoring, logging, observability, and production reliability.
- Experience scripting with Python, Shell, or Groovy.
- Excellent communication skills and experience working within cross-functional engineering teams.
Nice to Have
- Experience with Vertex AI Model Monitoring.
- Knowledge of LangGraph, LangChain, or AI orchestration frameworks.
- Experience building infrastructure for machine learning workloads.
- Background in gaming, AI-driven personalisation, fraud detection, or other real-time, high-scale environments.