Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
About the Team
The Data Center GPU Power and Performance Attainment (PPA) Team is a hardware‑focused lab organization responsible for optimizing power, performance, and performance‑per‑watt across AMD's Data Center GPU products. The team works at the intersection of silicon, systems, firmware, and workloads, driving post‑silicon validation, power feature tuning, and product readiness for large‑scale AI and HPC deployments.
The Opportunity
Serve as Platform PPA debug engineer for Instinct Datacenter GPUs, driving GPU/system/board-level triage of Power, Performance, Thermal and VF issues and resolution using lab reproduction plus HW/FW/SW telemetry and logs in partnership with cross-functional teams.
The PersonAn engineer with deep expertise in datacenter platform power/performance/thermal debug and optimization. Hands-on in the lab and effective across hardware, firmware/BIOS, and software teams, they use Linux logs and telemetry to drive issues to root-cause and closure.
What You'll Do- Lead GPU/system/board-level debug of power, performance, VF, and thermal issues reported by internal teams and external customers.
- Analyze platform telemetry, Linux logs, and FW/BIOS signals to isolate failures that span hardware, firmware, and software.
- Coordinate across architecture, design, validation, software, and customer engineering to drive root-cause and closure.
- Develop and maintain debug methodologies and automation to accelerate root-cause analysis.
- Resolve systemic issues that impact power, perf/watt or performance targets, and validate improvements.
- Lead debug cadence (meetings, executive updates) to align stakeholders, communicate status/trends, and remove blockers.
- Partner with the extended team in Malaysia to ensure global debug coverage and continuity.
- Own customer escalations with the debug council/customer engineering, including test execution to confirm resolution and close issues.
- Use manufacturing screens/data and failure analysis (lab, manufacturing, field returns) to identify root cause and drive corrective actions.
- Mentor junior engineers on debug execution and best practices.
- Document debug findings, resolutions, and learnings to improve internal reuse and next-generation test plans.
- 8+ years of experience in silicon power, performance, and thermal characterization, debug, validation or customer engineering roles.
- Solid understanding of semiconductors, CPU/GPU architecture, and power management features, including power, thermal, VF, and performance aspects of design and validation.
- Experience with system/platform debug workflows and cross-functional issue triage across hardware, firmware/BIOS, and software.
- Hands-on experience with lab debug tools (e.g., logic analyzers, oscilloscopes, power monitors) and with server platform bring-up/triage involving high-speed I/O (e.g., PCIe/CXL), power delivery, and board-level sequencing.
- Proficiency in scripting (Python, Perl, shell) for automation, log parsing, and data analysis.
- Familiarity with firmware and low-level software interactions with hardware (including BIOS and BMC interfaces).
- Experience working with customer engineering and manufacturing teams.
- Excellent communication and documentation skills, including executive reporting and leading cross-domain meetings.
- Experience with HPC/AI workloads and GPU performance benchmarks in datacenter environments (boards, systems, racks, clusters); familiarity with AI-assisted analysis/debug tooling is a plus.
Why AMD?
- Work on industry‑leading AI and HPC platforms deployed at massive scale.
- Hands‑on ownership from silicon bring‑up through customer deployment.
- Highly collaborative, technically deep environment with strong career growth.
- Competitive compensation, benefits, and global career opportunities.
Academic Credentials
Bachelors in Computer Engineering, Electrical Engineering, or Computer Science. MS Preferred.
This role is not eligible for visa sponsorship.
#LI-RL1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.
This posting is for an existing vacancy.
Apply on company website