Receive alerts when this company posts new jobs.

Similar Jobs

Job Details

Software Engineer, Systems Reliability

at OpenAI

Posted: 9/4/2020
Job Status: Full Time
Job Reference #: 086d853c-eae2-4140-a709-3783bdfd1d31
Keywords: engineer

Job Description

Software Engineer, Systems Reliability

San Francisco /
Supercomputing /

Apply for this job

Join us in building some of the largest AI supercomputing clusters in the world! You will manage and scale the company's supercomputers (powered by Kubernetes), build our research platform, and work on cross-functional projects to accelerate progress at the cutting-edge of AI research. We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud and containerized environment.

We recently launched our newest cluster, “Owl” with over 250K cores, 10K GPUs, and 400Gbps of networking per node. This would be in the top 5 of the TOP500 supercomputers in the world. See
this blog post to get a sense of what kind of challenges we solve in our day-to-day work.

In this role, you will work closely with machine learning researchers, but don't need to be a machine learning expert yourself. We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve. Experience with high-performance computing, or open-source contributions is a bonus.

We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

We look for a blend of:

Experience designing, implementing and running production services

Comfort managing and monitoring large-scale infrastructure deployments
Willingness to debug problems across the stack, such as networking issues, performance problems, or memory leaks
Ownership problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done
We estimate that someone with 3-5+ years of experience as a software or DevOps engineer will quickly contribute to our challenges

You might enjoy this work if you:

Know your way around bash, Terraform, Python, and/or Chef

Have experience running large Kubernetes clusters with GPU workloads, in the range of 500-1000 clusters
Can design a highly-available distributed system
Have helped a team mature with standardized tools and processes around stability, observability, and scaling
Have worked with highly performant bare-metal systems
Have experience working with Azure

About OpenAI

We’re building safe Artificial General Intelligence (AGI), and ensuring it leads to a good outcome for humans. We believe that unreasonably great results are best delivered by a highly creative group working in concert. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

This position is subject to a background check for any convictions directly related to its duties and responsibilities. Only job-related convictions will be considered and will not automatically disqualify the candidate. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.


- Health, dental, and vision insurance for you and your family
- Unlimited time off (we encourage 4+ weeks per year)
- Parental leave
- Flexible work hours
- Lunch and dinner each day
- 401(k) plan

Apply for this job