Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
Lead DCGPU Performance Engineer
THE ROLE:
AMD is looking for an outstanding technical contributor to drive performance measurement and characterization of Data Center GPU (DCGPU) systems for AI workloads.
This role focuses on producing accurate, repeatable, and trustworthy performance data across a wide range of AI workloads, platforms, and configurations. The engineer will establish robust measurement methodologies and ensure consistency across environments, enabling reliable performance insights for engineering, product, and business decisions.
THE PERSON:
As a highly detail-oriented and data-driven DCGPU Performance Engineer, you will specialize in performance measurement at system and workload levels, ensuring that results are reproducible, comparable, and representative of real-world behavior.
You will define and enforce best practices for performance measurement across AI workloads, including training and inference, while accounting for system variability, configuration differences, and evolving software stacks. You are expected to become an expert user of internal performance tools and workflows, enabling efficient data collection and high-quality reporting.
The ideal candidate combines deep technical expertise with a strong sense of rigor and discipline in experimentation, validation, and reporting. You are expected to question results, validate assumptions, and continuously improve measurement infrastructure through close collaboration with tools teams.
KEY RESPONSIBILITIES:
Performance Measurement & Characterization
- Measure performance of DCGPU systems across AI workloads (training, inference, microbenchmarks)
- Ensure accurate capture of key metrics such as throughput, latency, efficiency, and scaling behavior
- Validate performance across different system configurations, software stacks, and runtime environments
Reproducibility & Methodology
- Define and enforce best practices for reproducible performance measurement
- Ensure experiments are repeatable across systems, teams, and time
- Establish controls for variables such as software versions, system configuration, and workload parameters
- Develop standardized methodologies for fair and consistent comparisons
Benchmarking & Workload Execution
- Execute AI workloads (LLMs, training, inference) with well-defined configurations
- Ensure consistency in workload setup, execution, and reporting across runs
- Maintain benchmark definitions and configuration baselines
- Support internal and competitive benchmarking efforts
Data Accuracy & Validation
- Cross-check results for anomalies, inconsistencies, and measurement errors
- Validate data using multiple methods (profiling tools, logs, counters, independent runs)
- Identify sources of measurement noise and variability and mitigate them
- Ensure published results are reliable, defensible, and aligned with methodology
Tooling & Measurement Infrastructure
- Develop and enhance tools for performance measurement, logging, and reporting
- Build automation for workload execution, result collection, and validation
- Enable standardized output formats for consistent analysis and reporting
- Improve measurement workflows to increase efficiency and reliability
Tools Expertise & Feedback Loop
- Become an expert user of internal performance tools to collect metrics, analyze results, and generate reports
- Work closely with tools and infrastructure teams to enable efficient and scalable measurement workflows
- Provide actionable feedback to tools teams to improve automation, usability, performance, and coverage
- Help drive adoption of standardized tools and workflows across the organization
Cross-Functional Collaboration
- Work with performance engineers, software teams, and system teams to align on measurement practices
- Provide trusted performance data to architecture, product, and business stakeholders
- Support performance deep dives and root-cause investigations when discrepancies arise
Reporting & Insights
- Generate clear, structured performance reports with documented methodology
- Ensure performance results are communicated with appropriate context, assumptions, and limitations
- Enable decision-making through reliable and high-quality data
PREFERRED EXPERIENCE:
- 8–12+ years of experience in performance measurement, benchmarking, or system characterization
- Strong understanding of performance measurement of complex SoCs (GPUs is a plus) and AI workloads (training and inference)
- Experience with performance benchmarking methodologies and reproducibility practices
- Hands-on experience with profiling and measurement tools (rocprof, Nsight, perf, etc.)
- Experience with large-scale AI workloads (LLMs, distributed training, inference serving)
- Familiarity with system-level performance variability and benchmarking challenges
- Programming experience in Python, C/C++, or scripting for automation
- Strong analytical mindset with attention to detail and data validation
- Experience building or maintaining benchmarking frameworks or infrastructure is a plus
POSITION REQUIREMENTS:
- Proven experience in performance measurement and benchmarking for complex systems
- Strong focus on reproducibility, accuracy, and methodological rigor
- Experience running and analyzing AI workloads on GPU systems
- Familiarity with system configuration, software stack dependencies, and performance variability
- Ability to identify and resolve discrepancies in performance data
- Experience working with performance tools and influencing tooling improvements
- BS/MS in Computer/Electrical Engineering, Computer Science, or related field
- Excellent written and verbal communication skills, especially in documenting results and methodologies
#DC-GPU
#LI-PK1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.
This posting is for an existing vacancy.
Apply on company website