Always-on, live observability for Multi-GPU PyTorch training (DDP)
📋 User Survey (2 min): https://forms.gle/KwPSLaPmJnJjoVXSA
TraceML is a lightweight runtime observability tool for PyTorch DDP training (currently on single-node Multi-GPU).
It surfaces step-level, rank-aware signals while your job runs, without turning on heavy profilers. It answers
“What’s happening inside my training step right now — and is a particular rank behaving worse than the rest?”
If your run is healthy, TraceML should say so.
The terminal view updates continuously during training and shows:
- Step time, dataloader fetch time, and GPU memory
- Median vs worst rank (to spot imbalance / stragglers)
- System signals (CPU, RAM, GPU) alongside training signals
This is the primary interface, designed to stay open next to your training logs.
The web dashboard mirrors the same signals in a browser:
- Interactive charts over recent steps
- Rank-aware comparisons
- Useful for exploration and longer-running jobs
The web UI is read-only and reflects exactly what TraceML computes during training.
Both views are driven by the same runtime signals and update live, step by step.
Training deep learning becomes a black box as we scale from single GPU
Typical pain:
- Steps get slow / unstable and it’s unclear if the cause is input, compute, sync/comm, or optimizer work
- “It’s slower on 8 GPUs than 1 GPU” and you don’t know which rank or which part is lagging
- OOMs and crashes with little context for “where did it happen?”
- Full profilers are powerful, but often too intrusive to keep enabled in real training
TraceML is designed to be live: show the minimum useful truth during real runs.
When you wrap your iteration with trace_step(), TraceML tracks step-scoped signals and summarizes them across ranks:
- Dataloader fetch time
- Step time (GPU-aware via CUDA events without sync)
- GPU memory (allocated + peak)
Across ranks, TraceML reports:
- Median (typical behavior)
- Worst (the slowest / highest-memory rank)
This helps you spot rank imbalance / straggler-like behavior early.
In Deep-Dive mode, TraceML installs model hooks to give more context around failures:
- Show per-layer memory and timing usage (worst across all ranks)
- Helps identify where an OOM/crash happened (forward/backward region and the most suspicious layer signals)
- Experimental and evolving — meant to be a practical debugging aid, not a formal profiler
TraceML is not a profiler replacement or an auto-tuner.
- It does not replace Nsight / PyTorch Profiler
- It does not automatically fix batch size or optimizer settings
- It will not always “find a problem”
TraceML currently supports:
- 🖥️ Terminal dashboard — live updates in your console (Rich UI)
- 🌐 Web dashboard — local browser UI at
http://localhost:8765
Notebook view is temporarily disabled.
TraceML provides two profiles so you can choose insight vs overhead.
Designed for continuous usage during real training.
Tracks:
- Dataloader fetch time
- Step time (GPU-aware)
- Step GPU memory (allocated + peak)
- System metrics (CPU/RAM/GPU)
Designed for investigating slowdowns and failures.
Includes everything in ESSENTIAL, plus:
- Per-layer memory signals
- Per-layer forward/backward timing signals
- Lightweight failure attribution via hooks (experimental)
For development:
git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'
pre-commit install
Requirements: Python 3.9–3.13, PyTorch 1.12+
Platform: macOS (Intel/ARM), Linux
Training support: Single GPU + **single-node DDP **
TraceML’s core signals are computed inside trace_step().
from traceml.decorators import trace_step
for batch in dataloader:
with trace_step(model):
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Without trace_step():
- Step timing is not computed
- Step memory is not recorded
- Live dashboards won’t update meaningfully
from traceml.decorators import trace_model_instance
trace_model_instance(model)
Use this together with trace_step(model) to enable hook-based deep signals:
- layer-level memory/timing
- experimental failure attribution
@trace_time/ region user timers is removed for now.
TraceML is focusing on step-level semantics + optional Deep-Dive hooks.
traceml run train.py --nproc-per-node=2
You’ll see a live terminal dashboard showing:
- System resources (CPU/RAM/GPU)
- Dataloader fetch time, step time, step GPU memory
- (Deep-Dive) per-layer signals + failure attribution hints
traceml run train.py --nproc-per-node=2 --mode=dashboard
Opens http://localhost:8765 with interactive charts and live updates.
Near-term:
- Single-node DDP hardening: reduce overhead, improve step alignment accuracy, improve collector/UI performance
- Run logging to disk: per-run artifacts + compact run summaries
- Compatibility & failure modes: validate behavior for common training patterns:
- gradient accumulation
torch.compile- cases that bypass typical hooks / patch points
- Documentation: clearer docs, examples, and “known limitations” page
- Accelerate / Lightning wrappers
Next:
- Multi-node DDP
- FSDP: shard-aware aggregation + imbalance signals (initial support)
Later:
- TP / PP: multi-process-group + mesh/stage-aware attribution
Contributions are welcome.
- ⭐ Star the repo
- 🐛 Report bugs via GitHub Issues
- 💡 Request features / workloads you want supported
- 🔧 Submit PRs (small focused PRs are ideal)
When opening an issue, please include:
- minimal repro script
- hardware + CUDA + PyTorch versions
- ESSENTIAL vs DEEP-DIVE
- single GPU vs DDP
Stars help more teams find the project. 🌟
TraceML is released under the Apache 2.0.
See LICENSE for details.
If TraceML helps your research, please cite:
@software{traceml2024,
author = {TraceOpt AI},
title = {TraceML: Real-time Training Observability for PyTorch},
year = {2024},
url = {https://github.com/traceopt-ai/traceml}
}
Made with ❤️ by TraceOpt AI

