Associate Director, Sr Principal Systems Engineer - Princeton, NJ

1 Days Old

ATTENTION MILITARY AFFILIATED JOB SEEKERS - Our organization works with partner companies to source qualified talent for their open roles. The following position is available to Veterans, Transitioning Military, National Guard and Reserve Members, Military Spouses, Wounded Warriors, and their Caregivers. If you have the required skill set, education requirements, and experience, please click the submit button and follow the next steps. Unless specifically stated otherwise, this role is "On-Site" at the location detailed in the job post.
Challenging. Meaningful. Life-changing. Those aren't words that are usually associated with a job. But working at Bristol Myers Squibb is anything but usual. Here, uniquely interesting work happens every day, in every department. From optimizing a production line to the latest breakthroughs in cell therapy, this is work that transforms the lives of patients, and the careers of those who do it. You'll get the chance to grow and thrive through opportunities uncommon in scale and scope, alongside high-achieving teams. Take your career farther than you thought possible.
Bristol Myers Squibb recognizes the importance of balance and flexibility in our work environment. We offer a wide variety of competitive benefits, services and programs that provide our employees with the resources to pursue their goals, both at work and in their personal lives.
Summary:
Bristol Myers Squibb is looking for an experienced Sr Principal Systems Engineer in HPC/AI infrastructure to work with our technology teams and various stakeholders to design, manage, and support cutting-edge HPC/AI infrastructure platforms to serve our community of researchers and scientists, who are using Machine Learning, Deep Learning, and High-Performance Computing every day to make groundbreaking discoveries.
Collaborating with cross functional teams within BMS, the systems engineer would work our teams to define and execute our HPC/AI roadmap for both on-premises datacenters and in the cloud, provide guidance and technical expertise to senior research leaders and scientists, and work to build out standards and best practice design principles to guide BMS' future roadmap.
Key areas of the role require strong knowledge and expertise in:
Software/Hardware Optimization, such as performance tuning for bespoke hardware, code refactoring, accelerated ML toolkit and libraries such as CUDA, and continuous integration of codes and ML models.
Development Tools and Environment, such as Git, Linux and python package management, pytorch lightning, containers, and Kubernetes.
Job/Scheduler Orchestration and Integration, knowledgeable in automating and integrating machine learning jobs with major resource schedulers such as SLURM, Grid Engine, AWS Batch, and Parallel Cluster to maximize throughput, performance, utilization, efficiency, and cost effectiveness for ML/AI training and prediction.
Datacenter/Colocation Operations, such as physical installation, networking or bespoke network fabrics, understanding of power/cooling, etc. are strongly preferred.
Vendor Outreach, ability to partner with leading vendors or partners to explore, experiment, and pilot proof-of-concept studies to help bring in, or deliver leading-edge, differentiating capabilities for BMS Research
Additional Qualifications/Responsibilities
Requirements:
Strong experience working with and supporting HPC users, including scientists, data scientists, and/or developers
Strong working experience with container runtimes and container orchestration platforms, including Kubernetes, Docker, and/or Singularity
Strong operational, architecture, and troubleshooting experience with cluster managers and schedulers, ideally Slurm but experience with other HPC schedulers should be acceptable.
Linux systems management and configuration management in an HPC environment
Expert troubleshooting skills with open source frameworks and libraries
Experience working with the NVIDIA software ecosystem and GPU-powered systems for Machine Learning and Deep Learning workloads (preferred)
Experience working with Deep Learning frameworks, libraries, and pipelines, either directly as a user or supporting researcher and/or data science users (preferred)
Experience working with parallel file systems for data storage strategies for large clusters (preferred)
Working knowledge of GPU profiling techniques (preferred)
The starting compensation for this job is a range from $170,000 - $195,000, plus incentive cash and stock opportunities (based on eligibility).
The starting pay rate takes into account characteristics of the job, such as required skills and where the job is performed.
Final, individual compensation will be decided based on demonstrated experience.
Location:
Princeton, NJ, United States
Category:
Computer And Mathematical Occupations

We found some similar jobs based on your search