Streaming speech recognition running natively and in the browser. A pure Rust implementation of Mistral’s Voxtral Mini 4B Realtime model using the Burn ML framework.
The Q4 GGUF quantized path (2.5 GB) runs entirely client-side in a browser tab via WASM + WebGPU.
# Download model weights (~9 GB)
uv run --with huggingface_hub \
hf download mistralai/Voxtral-Mini-4B-Realtime-2602 --local-dir models/voxtral
# Transcribe an audio file (f32 SafeTensors path)
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
--audio audio.wav --model models/voxtral
# Or use the Q4 quantized path (~2.5 GB)
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
--audio audio.wav --gguf models/voxtral-q4.gguf --tokenizer models/voxtral/tekken.json
# Build WASM package
wasm-pack build --target web --no-default-features --features wasm
# Generate self-signed cert (WebGPU requires secure context)
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \
-keyout /tmp/voxtral-key.pem -out /tmp/voxtral-cert.pem \
-days 7 -nodes -subj "/CN=localhost"
# Start dev server
bun serve.mjs
Open https://localhost:8443, accept the certificate, and click Load from Server to download the model shards. Record from your microphone or upload a WAV file to transcribe.
Audio (16kHz mono)
→ Mel spectrogram [B, 128, T]
→ Causal encoder (32 layers, 1280 dim, sliding window 750)
→ Conv 4x downsample → Reshape [B, T/16, 5120]
→ Adapter [B, T/16, 3072]
→ Autoregressive decoder (26 layers, 3072 dim, GQA 32Q/8KV)
→ Token IDs → Text
| F32 (native) | Q4 GGUF (native + browser) | |
|---|---|---|
| Weights | SafeTensors (~9 GB) | GGUF Q4_0 (~2.5 GB) |
| Linear ops | Burn tensor matmul | Custom WGSL shader (fused dequant + matmul) |
| Embeddings | f32 tensor (1.5 GiB) | Q4 on GPU (216 MB) + CPU bytes for lookups |
| Browser | No | Yes (WASM + WebGPU) |
The upstream mistral-common library left-pads audio with 32 silence tokens (at 12.5 Hz). After the mel/conv/reshape pipeline, this covers only 16 of the 38 decoder prefix positions with silence — the remaining 22 contain actual audio. The f32 model handles this fine, but Q4_0 quantization makes the decoder sensitive to speech content in the prefix: audio that starts immediately with speech (mic recordings, clips with no leading silence) produces all-pad tokens instead of text.
We increase the left padding to 76 tokens, which maps to exactly 38 decoder tokens of silence and covers the full streaming prefix. See src/audio/pad.rs for details.
Running a 4B model in a browser tab required solving five hard constraints:
- 2 GB allocation limit —
ShardedCursorreads across multipleVecbuffers - 4 GB address space — Two-phase loading: parse weights, drop reader, then finalize
- 1.5 GiB embedding table — Q4 embeddings on GPU + CPU-side row lookups
- No sync GPU readback — All tensor reads use
into_data_async().await - 256 workgroup invocation limit — Patched cubecl-wgpu to cap reduce kernel workgroups
# Native (default features: wgpu + native-tokenizer)
cargo build --release
# With all features
cargo build --release --features "wgpu,cli,hub"
# WASM
wasm-pack build --target web --no-default-features --features wasm
| Feature | Description |
|---|---|
wgpu (default) |
GPU backend via Burn/CubeCL (WebGPU, Vulkan, Metal) |
native-tokenizer (default) |
Tekken tokenizer (C deps, not WASM-compatible) |
wasm |
Browser support: wasm-bindgen, WebGPU device init, JS bindings |
cli |
CLI binary with clap + indicatif |
hub |
HuggingFace Hub model downloads |
# Unit + integration tests (107 tests)
cargo test --features "wgpu,cli,hub"
# Lint
cargo clippy --features "wgpu,cli,hub" -- -D warnings
cargo clippy --no-default-features --features wasm --target wasm32-unknown-unknown -- -D warnings
# E2E browser test (requires Playwright + model shards)
bunx playwright test tests/e2e_browser.spec.ts
The GGUF file must be split into shards of 512 MB or less to stay under the browser’s ArrayBuffer limit:
split -b 512m models/voxtral-q4.gguf models/voxtral-q4-shards/shard-
The dev server and E2E test discover shards automatically from models/voxtral-q4-shards/.
src/
audio/ # Mel spectrogram, chunking, resampling, padding
models/ # F32 model: encoder, decoder, adapter, attention, RoPE, KV cache
gguf/ # Q4 GGUF: reader, loader, model, tensor, WGSL shader, tests
web/ # WASM bindings: VoxtralQ4, initWgpuDevice, async decode loop
tokenizer/ # Tekken tokenizer wrapper (native only)
bin/transcribe # CLI binary
web/ # Browser demo: index.html, worker.js, voxtral-client.js
tests/ # Integration tests + Playwright E2E spec
patches/ # cubecl-wgpu workgroup size fix for WebGPU
Apache-2.0