Search

ML Infrastructure Engineer, Autonomous AI

Job Category	Autonomous AI
Location	PALO ALTO, California
Req. ID	105051
Job Type	Full-time

Apply

What to Expect

As a Software Engineer within the Autonomy group, you will work on reinforcing, optimizing, and scaling our neural network training & auto-labeling infrastructure both for Autonomous AI and the the edge device products.

At the core of our autonomy capabilities are multiple neural networks that the Deep Learning team is designing to train on very large amounts of data, across large-scale GPU clusters and soon our large-scale cloud infrastructure. Robustly training networks at scale, should it be for production models or quick experiments, and completing them in the shortest amount of time possible, is critical to our mission.

What You’ll Do

Write robust Python software code in our machine learning training repository while applying best software practices to support machine learning scientists in tasks such as fetching training data, preprocessing it, and orchestrating the training runs.
Integrate the training software into our continuous integration cluster to support metrics persistence across experiments, weekly/nightly neural network builds, and other unit / throughput tests.
Profile performance of training software in our training cluster, identify bottlenecks in and between CPU/GPU code execution, and work on optimizing its throughput and scalability within and across nodes to ultimately reduce convergence time.
Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning.

What You’ll Bring

Practical experience programming in Python and/or C/C++.
Proficient in system-level software, in particular hardware-software interactions and resource utilization.
Understanding of modern machine learning concepts and state of the art deep learning.
Experience working with training frameworks, ideally PyTorch.
Demonstrated experience scaling neural network training jobs across clusters of GPU’s.
Optional: Experience programming in Cuda.
Optional: Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc).
Optional: Devops experience, in particular dealing with clusters of training nodes, and filesystems for very large amount of training data.

Compensation and Benefits

Benefits

Along with competitive pay, as a full-time employee, you are eligible for the following benefits at day 1 of hire:

Medical plan options with $0 payroll deduction
Family-building, fertility, adoption and surrogacy benefits
Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
Healthcare and Dependent Care
LGBTQ+ care concierge services
Contributory pension plans, employee property plans, and other benefits
Company-subsidized basic life, short-term and long-term disability insurance
Employee Assistance Program
Sick and Vacation time (Flex time for salary positions), and Paid Holidays
Back-up childcare and parenting support resources
Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, and theft & legal services
Weight Loss and Tobacco Cessation Programs
Babies program
Commuter benefits
Employee discounts and perks program

Expected Compensation

$104,000 – $240,000/annual salary + benefits

Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.

ML Infrastructure Engineer, Autonomous AI

What to Expect

What You’ll Do

What You’ll Bring

Compensation and Benefits

Join Us