Senior ML Systems Engineer

Advanced Technical Recruitment

We are looking for a Senior ML Systems Engineer to build and validate simulation infrastructure for large-scale machine learning systems. This role focuses on modelling the compute and communication behaviour of systems used for ML training and inference, and using simulation to guide architecture, performance optimization, and capacity planning.

The ideal candidate combines strong systems experience with hands-on experience in measurement, benchmarking, and performance analysis of modern ML systems.

Experience:

The ideal candidate will have strong experience in ML systems, distributed systems, performance engineering, computer architecture, or simulation and hands-on experience with performance benchmarking, profiling, and measurement of ML systems.

You should have an understanding of systems used for machine learning training and inference, coupled with experience analysing compute, communication, and memory behaviour in large-scale ML systems.

Experience with distributed training concepts such as data parallelism, tensor/model parallelism, pipeline parallelism, collectives, and synchronization overheads.

Preference is for a proficiency in one of the following Python, C++, or Rust.

You should have strong analytical skills and the ability to connect simulation results to real system behaviour.

Qualifications:

We are looking for Master’s, or PhD in Computer Science, Electrical Engineering, Computer Engineering, or a related field.

Essential Requirements:

Candidates MUST be eligible to work and live in the UK, without ever requiring sponsorship. Copies of Visa and Passport will be requested.

Candidates MUST be able to work onsite / commute to London on a hybrid basis.

Candidates MUST have experience in simulated distributed ML training/inference workloads.

Candidates MUST have profiled distributed GPU-based ML workloads (inference/training)

Candidates MUST have experience with packet-level/discrete event simulation using ns3 or similar.

Salary / Benefits:

In addition to a Competitive Salary, my client offers a range of Benefits including Hybrid and Flexible Working, Stock Options, 25 days holiday, and relocation assistance.

Skills: ML Systems, Python, C++, Rust, Simulation, Machine Learning, GPU, PyTorch, JAX, XLA, NVLink, PCIe.

Apply Now →

Application opens at the source listing. Free for jobseekers.