ML Infrastructure Engineer, Autonomous AI

Job CategoryAutonomous AI
LocationPALO ALTO, California
Req. ID105051
Job TypeFull-time
What to Expect

As a Software Engineer within the Autonomy group, you will work on reinforcing, optimizing, and scaling our neural network training & auto-labeling infrastructure both for Autonomous AI and the the edge device products.

At the core of our autonomy capabilities are multiple neural networks that the Deep Learning team is designing to train on very large amounts of data, across large-scale GPU clusters and soon our large-scale cloud infrastructure. Robustly training networks at scale, should it be for production models or quick experiments, and completing them in the shortest amount of time possible, is critical to our mission.

What You’ll Do
  • Write robust Python software code in our machine learning training repository while applying best software practices to support machine learning scientists in tasks such as fetching training data, preprocessing it, and orchestrating the training runs.
  • Integrate the training software into our continuous integration cluster to support metrics persistence across experiments, weekly/nightly neural network builds, and other unit / throughput tests.
  • Profile performance of training software in our training cluster, identify bottlenecks in and between CPU/GPU code execution, and work on optimizing its throughput and scalability within and across nodes to ultimately reduce convergence time.
  • Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning.
What You’ll Bring
  • Practical experience programming in Python and/or C/C++.
  • Proficient in system-level software, in particular hardware-software interactions and resource utilization.
  • Understanding of modern machine learning concepts and state of the art deep learning.
  • Experience working with training frameworks, ideally PyTorch.
  • Demonstrated experience scaling neural network training jobs across clusters of GPU’s.
  • Optional: Experience programming in Cuda.
  • Optional: Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc).
  • Optional: Devops experience, in particular dealing with clusters of training nodes, and filesystems for very large amount of training data.
Compensation and Benefits


Along with competitive pay, as a full-time employee, you are eligible for the following benefits at day 1 of hire:

  • Medical plan options with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
  • Healthcare and Dependent Care
  • LGBTQ+ care concierge services
  • Contributory pension plans, employee property plans, and other benefits
  • Company-subsidized basic life, short-term and long-term disability insurance
  • Employee Assistance Program
  • Sick and Vacation time (Flex time for salary positions), and Paid Holidays
  • Back-up childcare and parenting support resources
  • Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, and theft & legal services
  • Weight Loss and Tobacco Cessation Programs
  • Babies program
  • Commuter benefits
  • Employee discounts and perks program

Expected Compensation

$104,000 – $240,000/annual salary + benefits

Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.