
Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
Senior AI Infrastructure Solutions Engineer-Data Center GPU
THE TEAM:
Join AMD's Datacenter GPU team, a dynamic group of engineers and innovators dedicated to building cutting-edge AI infrastructure for the largest-scale AI inference and training workloads. Our team collaborates closely with product management, hardware partners, and customers to design and deploy transformative AI datacenter solutions.
THE ROLE:
The AMD Datacenter GPU team is seeking an experienced Solutions Engineer to join our team focused on enabling very large clusters for AI inference and training workloads. The ideal candidate will be a technical expert in Kubernetes-based AI infrastructure with deep knowledge of datacenter-level solutions for AI inference and training. This role offers the opportunity to work at the cutting edge of AI infrastructure, solving complex technical challenges and helping customers implement transformative AI solutions at scale.
THE PERSON:
The ideal candidate is a hands-on technical expert with deep experience in Kubernetes and container orchestration for AI workloads. Will have strong background in datacenter networking, storage, and GPU-accelerated environments. Proven ability to translate complex technical concepts into practical, scalable solutions and excellent communication and collaboration skills, comfortable interfacing with customer engineering teams and internal partners. Passion for solving challenging problems in AI infrastructure and infrastructure automation.
KEY RESPONSIBILITIES:
- Design, test, and validate reference architectures for large-scale AI training and inference clusters.
- Develop comprehensive tools for AI training to enable efficient cluster management.
- Create detailed reference documentation and implementation guides for customers and internal teams.
- Serve as the primary technical interface with customer engineering teams during deployment planning.
- Conduct proof-of-concept implementations to validate designs in real-world scenarios.
- Evaluate and benchmark performance of various infrastructure configurations.
- Provide expert guidance on optimizing Kubernetes for AI workloads at scale.
- Collaborate with product management to influence roadmap based on customer requirements.
- Maintain deep technical expertise in emerging AI infrastructure technologies.
- Coordinate customer requirements gathering and work with the relevant Technical Program Management counterpart to arrive at a deployment plan.
- Creation of comprehensive, tested reference architectures that accelerate customer deployments.
- Drive test and interoperability validation with our HW and SW partners, lead implementation of reference datacenter solutions at our CSP partners.
- Development of automation tools that significantly reduce deployment complexity.
- Establishment as a trusted advisor to customer technical teams.
- Contribution to increased win rates through technical credibility and expertise.
- Regular feedback that improves our product roadmap and offering.
PREFERRED EXPERIENCE:
- Several years of experience designing and implementing large-scale infrastructure solutions.
- Deep knowledge of Kubernetes and container orchestration technologies.
- Hands-on experience with AI/ML workloads in production environments.
- Strong understanding of datacenter networking and storage architectures.
- Experience with GPU-accelerated computing environments.
- Proven track record of creating technical documentation and reference architectures.
- Network design for high-throughput GPU clusters.
- Storage architectures optimized for AI data pipelines.
- Infrastructure automation and orchestration tools.
- Performance optimization for large-scale inference deployments.
- Excellent communication skills with the ability to explain complex technical concepts.
- Experience working directly with customer technical teams.
- Experience with Ray, PyTorch, HPC optimized schedulers for Kubernetes based AI training.
- Hands-on experience with SLURM or similar HPC schedulers.
- Knowledge of infrastructure-as-code tools (Terraform, Ansible, etc.).
- Familiarity with cloud-native observability and monitoring solutions.
- Understanding of security considerations for AI infrastructure.
- Background in performance tuning for GPU-accelerated workloads.
- Experience creating automation tools for infrastructure deployment.
EDUCATION:
- Bachelor's degree in Computer Science, Engineering, or related field
- Advanced degree preferred.
#LI-EV1
#LI-HYBRID
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
Apply on company website