Site Reliability Engineer, AI Infrastructure
Tesla
Software Engineering, Other Engineering, Data Science
Palo Alto, CA, USA
Posted 6+ months ago
What to Expect What You’ll Bring
Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware, silicon design, and Dojo. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex. Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As the scope and impact of our Tesla Bot, Full-Self-Driving (FSD) & Robotaxi efforts continue to scale, so does the value of this team and its work.
What You’ll DoAs a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our Full-Self-Driving (FSD), Tesla Bot & Dojo engineering teams have the necessary tools and resources to be productive. This includes managing/operating our AI infrastructure, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and security. Your work will directly facilitate neural network training at scale, streamline FSD development, and enable Dojo to become the most powerful supercomputer to date.
- Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale
- Improve our monitoring & self-healing pipelines, as well as security posture
- Optimize our server, storage and network performance
- Develop new tools in Python, Golang or Bash/Shell
- Use Infrastructure as Code best practices
- Participate in 24x7 on-call rotation
- Proficiency in Python, Golang and/or Bash
- Proficiency with Linux fundamentals and performance optimizations
- Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)
- Experience with containerization technologies such as Kubernetes
- Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus
- Experience with Slurm, LSF and storage management of parallel file systems is a plus
- Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field
- 3+ years of additional equivalent experience or evidence of exceptional ability related to the position