After: compact debugging brief
Debug CUDA OOM.
Type:
- CUDA / PyTorch OOM
Keep:
- cmd: python train.py --model llama-7b --batch-size 64 --precision fp16
- frame: /workspace/train.py:214 -> loss = trainer.step(batch)
- error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB.
- detail: model=llama-7b; batch-size=64; precision=fp16
- memory: GPU0 23.65 GiB total; 1.08 GiB free; 21.40 GiB allocated; 22.49 GiB reserved
Folded:
- duplicate worker INFO, progress bar, wandb x2, NCCL x2, torch internals
Ask: root cause, smallest fix, verify command.