Director, Network Engineering
New Today
About FluidstackWe build and operate high-performance GPU clusters so the most ambitious teams can move fast, stay focused, and scale without friction. Our clusters power top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.Our team is highly motivated, and focused on providing a world class supercomputing experience. We put our customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.About the RoleAs Head of Networking, you will lead the architecture, design, and operations of our network services that power our AI infrastructure platform. In this role, you will architect networks that move packets for frontier AI models while ensuring maximum reliability and performance through extensive automation. You will build a team that spansYou will build and lead a world-class networking team ranging from junior network engineers eager to learn high-performance computing, to senior architects who have scaled networks at hyperscalers, to specialized engineers with deep expertise in RDMA/InfiniBand for AI workloads. Your team will span network operations, architecture, automation engineering, and performance optimization roles. You'll be responsible for hiring, mentoring, and developing this team while establishing a culture of technical excellence and continuous learningFocusBuild networks that scale beyond hundreds of thousands of GPUs.Collaborate with compute, storage, security, and data center teams to deliver integrated infrastructure solutionsBuild and lead a team of network engineers and architects focused on performance, reliability, and automation.Automate everything. Manual processes kill velocity. Build systems that configure themselves, heal themselves, and optimize themselves. Drive automation initiatives across service deployment, provisioning, and lifecycle managementDesign scalable network architectures supporting clusters from 2,000 to 200,000 GPUsOptimize traffic patterns for AI/ML training workloads and high-performance computingLead the design and implementation of scalable, high-performance network architectures supporting GPU clusters and AI workloadsEstablish comprehensive monitoring, alerting, and incident response procedures. Create remediation systems that detect and resolve issues before customer impactLead root cause analysis and implement preventive measures for network incidentsEnsure network reliability, security, and performance meet the demanding requirements of AI supercomputing workloadsEnsure compliance with data sovereignty and regulatory requirementsAbout You10+ years of experience designing and operating large-scale network infrastructure5+ years in leadership roles at cloud providers, hyperscalers, or technology companiesDeep expertise in software-defined networking, routing protocols, and distributed network designProven track record scaling networks for high-throughput, low-latency workloadsExperience with AI/ML infrastructure and GPU cluster networking (RoCE / InfiniBand)Deep understanding of internet routing, switching, peering, and distributed network design.Expert knowledge of routing protocols (BGP, EVPN), TCP/IP, and network services (DHCP, DNS)Proven track record of designing and operating large-scale, high-performance networks in cloud or datacenter environmentsStrong knowledge of automation frameworks (e.g., Ansible, Terraform) and infrastructure-as-code principlesExperience offloading services into smart NICs and working with hardware acceleration technologiesExcellent communication skills with ability to influence technical strategy across organizationsMonitoring stacks (Prometheus, Grafana) and observability best practicesNice to havesContributions to open-source networking projectsExperience with network source of truth platforms (NetBox, Nautobot, ..) and integrating them with automation workflowsFamiliarity with Kubernetes networking, overlay networks, and container networking solutionsBenefitsCompetitive total compensation package (cash + equity).Retirement or pension plan, in line with local norms.Health, dental, and vision insurance.Generous PTO policy, in line with local norms.Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
#J-18808-Ljbffr
- Location:
- San Francisco, CA, United States