Medqa fine-tuned on AMD rocM shows clinical AI can run without cuda

SiliconFeed EditorialMay 8, 2026

Sections and tags — in the Topics menu Search the feed

At a glance:

LoRA‑adapted Qwen3‑1.7B was fine‑tuned on the AMD Instinct MI300X in about five minutes, using only ROCm and no CUDA dependencies.
Training used 2,000 MedMCQA samples, achieving roughly 45% baseline accuracy with a full‑precision fp16 run on 192 GB of HBM3 memory.
The entire pipeline – data loading, adapter export, and inference – runs on ROCm, and the adapter is publicly available on HuggingFace Hub.

Why amd rocm matters

The AMD Instinct MI300X packs a staggering 192 GB of HBM3 memory in a single device. For large‑language‑model fine‑tuning, VRAM is often the bottleneck that forces developers to resort to aggressive quantisation (4‑bit or 8‑bit) or tiny batch sizes. With the MI300X’s memory headroom the MedQA team was able to train the 1.7 billion‑parameter Qwen3‑1.7B model in full fp16, avoiding any quantisation tricks altogether. This not only simplifies the code path but also eliminates the quantisation‑induced artefacts that can hurt clinical reasoning.

Beyond raw memory, the experiment was designed to prove that the HuggingFace ecosystem – Transformers, PEFT, TRL and Accelerate – works out‑of‑the‑box on ROCm. The only ROCm‑specific changes were three environment variables:

os.environ["ROCR_VISIBLE_DEVICES"] = "0"
os.environ["HIP_VISIBLE_DEVICES"] = "0"
os.environ["HSA_OVERRIDE_GFX_VERSION"] = "9.4.2"

No code modifications, custom kernels, or CUDA compatibility layers were required. The same training script that runs on an NVIDIA GPU can be dropped onto an AMD system and start training.

Dataset and training setup

The fine‑tuning used the MedMCQA dataset, a collection of multiple‑choice questions drawn from Indian medical entrance exams that mimic USMLE‑style reasoning. Each entry contains a clinical question, four answer options (A–D), the correct answer index, and an optional free‑text explanation. For the hackathon demo the authors deliberately limited themselves to 2,000 training samples to keep the run short and reproducible.

Training was orchestrated with the standard HuggingFace Trainer API. Key arguments included two epochs, a physical batch size of four, gradient accumulation of four steps (effective batch size = 16), a learning rate of 2e‑4, and fp16 precision. Gradient checkpointing was enabled to trade compute for memory, even though the MI300X could have handled the full graph without it. The cosine learning‑rate schedule with a 5 % warm‑up further smoothed convergence for the brief five‑minute run.

Model and prompt design

The base model is Qwen/Qwen3‑1.7B, Alibaba’s compact yet capable 1.7 billion‑parameter language model. It loads cleanly with trust_remote_code=True and supports the full HuggingFace Transformers interface. LoRA (Low‑Rank Adaptation) was applied via the PEFT library, injecting trainable rank‑decomposition matrices into the attention layers while freezing the base weights. The LoRA configuration used r=8, lora_alpha=16, a dropout of 0.05, and targeted the q_proj and v_proj modules, resulting in roughly 2.2 million trainable parameters – just 0.15 % of the full model.

Prompt consistency was enforced with a fixed template that includes the question, options, and a placeholder for the answer and explanation. During training the model sees the full sequence, including the correct answer and its rationale; during inference the prompt stops at the ### Answer: marker and the model completes the answer letter and a concise clinical explanation. This design yields outputs that are not just a letter but a reasoning paragraph, which is essential for safety‑critical medical use‑cases.

Training results and performance

The experiment produced concrete metrics that illustrate the hardware advantage. Trainable parameters: ~2.2 M (0.15 % of total). Training time on the MI300X: ~5 minutes for the 2,000‑sample slice. Baseline MedMCQA accuracy on the un‑adapted Qwen3‑1.7B sits around 45 %; the fine‑tuned adapter pushes the model to produce correct answer letters with accompanying explanations, though the article does not quote a post‑fine‑tune accuracy figure. The framework stack was PyTorch + ROCm 6.1, confirming that the open‑source stack runs natively on AMD hardware without any NVIDIA‑specific patches.

Challenges and fixes

Even a smooth‑running ROCm pipeline encountered hiccups. The team documented four primary failure modes and their remedies:

NaN loss – caused by mixed‑precision instability; resolved by switching from bfloat16 to fp16.
GPU not detected – missing ROCm environment variables; fixed by setting ROCR_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, and HSA_OVERRIDE_GFX_VERSION.
bitsandbytes unsupported – no ROCm build exists; the team dropped quantisation entirely, relying on the MI300X’s ample VRAM.
Garbage inference output – tokenizer padding misconfiguration; solved by setting pad_token = eos_token and correcting padding_side. Additional trainer evaluation errors were traced to a Transformers version mismatch and were solved by pinning transformers>=4.40.0.

Looking ahead

The authors see this proof‑of‑concept as a springboard for larger‑scale medical AI on AMD platforms. Planned next steps include training on the full MedMCQA corpus (~180 k questions) plus PubMedQA, adding calibrated confidence scores, integrating Retrieval‑Augmented Generation (RAG) to ground answers in up‑to‑date literature, and building a robust evaluation harness that benchmarks held‑out accuracy beyond the training split. The open‑source adapter is already hosted on HuggingFace (HK2184/medqa-qwen3-lora) and can be merged into the base model for downstream applications.

The project demonstrates that with a high‑memory AMD GPU, the entire fine‑tuning and inference workflow for a clinically useful LLM can be completed in minutes, without the CUDA‑centric assumptions that dominate most open‑source medical AI work.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

FAQ

How long did the fine‑tuning process take on the AMD MI300X?

The entire LoRA fine‑tuning run on the AMD Instinct MI300X lasted roughly five minutes. The authors used 2,000 MedMCQA training samples, two epochs, and an effective batch size of 16, which together kept the compute time short while still demonstrating meaningful performance gains.

What environment variables are required to run HuggingFace training on ROCm?

Only three ROCm environment variables need to be set: `ROCR_VISIBLE_DEVICES="0"`, `HIP_VISIBLE_DEVICES="0"`, and `HSA_OVERRIDE_GFX_VERSION="9.4.2"`. No code changes, custom kernels, or CUDA shims are necessary beyond these settings.

What were the main challenges encountered and how were they resolved?

The team faced NaN loss (fixed by switching from bfloat16 to fp16), GPU detection failures (solved by setting the three ROCm env vars), lack of bitsandbytes support on ROCm (removed quantisation entirely thanks to 192 GB VRAM), and garbage inference output (corrected by aligning the tokenizer’s pad token with the EOS token). Additionally, trainer errors were addressed by pinning Transformers to version 4.40.0 or newer.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article