Transformers.js Local Inference
Transformers.js runs ONNX models directly inside Wilson’s Bun process via the @huggingface/transformers v3 WASM backend. No API key. No external server. No Ollama dependency.
How It Works
Section titled “How It Works”Unlike Ollama (which runs a separate server process), Transformers.js loads models in-process. The first run downloads model weights from HuggingFace Hub and caches them locally. Subsequent runs load from cache in under 2 seconds.
- Runtime:
@huggingface/transformersv3, WASM backend - Cache location:
~/.openaccountant/models/ - API key: None required
- External server: None required
Supported Models
Section titled “Supported Models”WebGPU Models
Section titled “WebGPU Models”Require OPENACCOUNTANT_TRANSFORMERS_DEVICE=webgpu and Bun 1.2+.
| Model | Size | Description |
|---|---|---|
onnx-community/granite-4.0-micro-ONNX-web | ~3B | IBM Granite Micro, tool-calling support |
onnx-community/LFM2-1.2B-Tool-ONNX | ~1.2B | Purpose-built for tool use |
onnx-community/granite-4.0-350m-ONNX-web | ~350M | IBM Granite, fast inference |
onnx-community/Qwen3-0.6B-ONNX | ~0.6B | Qwen3 architecture |
CPU/WASM Models
Section titled “CPU/WASM Models”Work on any hardware. No GPU required.
| Model | Size | Description |
|---|---|---|
HuggingFaceTB/SmolLM3-3B-ONNX | ~2 GB | 92.3% BFCL score, best CPU option |
onnx-community/Qwen2.5-1.5B-Instruct | ~900 MB | Solid instruction following |
Configuration
Section titled “Configuration”Set the compute device via environment variable:
# Default — runs on CPU via WASM (broadest compatibility)OPENACCOUNTANT_TRANSFORMERS_DEVICE=cpu
# GPU-accelerated — requires Bun 1.2+ and a WebGPU-capable systemOPENACCOUNTANT_TRANSFORMERS_DEVICE=webgpuSelecting a Model
Section titled “Selecting a Model”Switch interactively with /model:
/model transformers:HuggingFaceTB/SmolLM3-3B-ONNXOr set as default in .env:
DEFAULT_MODEL=transformers:HuggingFaceTB/SmolLM3-3B-ONNXPre-Downloading Models
Section titled “Pre-Downloading Models”Download a model before first use to avoid the wait during inference:
/pull transformers:HuggingFaceTB/SmolLM3-3B-ONNXWilson downloads all model files from HuggingFace Hub and caches them to ~/.openaccountant/models/. After this, the model loads from cache in under 2 seconds.
Tool Calling
Section titled “Tool Calling”Transformers.js models use prompt injection for tool calling. Wilson injects available tool schemas into the system prompt and parses the model’s response for tool call blocks:
<tool_call>{"name": "search_transactions", "arguments": {"query": "groceries"}}</tool_call>This works reliably for single-tool calls — asking the model to categorize a transaction or look up spending. It becomes unreliable for multi-tool chains where the model needs to call several tools in sequence.
Limitations
Section titled “Limitations”- Model quality. Small models (0.5B–3B) are far less capable than Claude or GPT for complex reasoning.
- First-run download. Models download from HuggingFace Hub on first use (~270 MB–2 GB depending on model). Use
/pullto pre-download. - Tool calling reliability. Prompt-injected tool calling works for simple single-tool calls. Complex multi-tool agent tasks often fail.
- No streaming. Responses are returned in full after generation completes.
When to Use Transformers.js
Section titled “When to Use Transformers.js”| Use Case | Transformers.js | Better Alternative |
|---|---|---|
| Offline, no internet | Yes | — |
| Simple Q&A about your data | Yes | — |
| Transaction categorization | Yes | — |
| Multi-tool agent tasks | No | Ollama or cloud provider |
| Complex financial analysis | No | Cloud provider |
| Fine-tuned model deployment | No | Ollama |
Transformers.js vs Ollama
Section titled “Transformers.js vs Ollama”| Transformers.js | Ollama | |
|---|---|---|
| Setup | Zero config, auto-downloads | Install Ollama, start server, pull model |
| Server process | None (in-process) | Separate server on port 11434 |
| API key | None | None |
| Model format | ONNX | GGUF |
| GPU support | WebGPU (Bun 1.2+) | Metal, CUDA, ROCm |
| Model selection | ~6 ONNX models | Thousands of GGUF models |
| Tool calling | Prompt injection (unreliable for chains) | Native tool calling (model-dependent) |
| Fine-tuned models | Limited (ONNX format only) | Full support (GGUF + Modelfile) |
| Best for | Offline, simple tasks, no setup | Local inference with full model ecosystem |
See Also
Section titled “See Also”- LLM Providers — All supported providers
- Small Language Models — Ollama-based local models
- Model Management —
/modeland/pullcommands