Transformers.js Local Inference

Transformers.js runs ONNX models directly inside Wilson’s Bun process via the @huggingface/transformers v3 WASM backend. No API key. No external server. No Ollama dependency.

How It Works

Unlike Ollama (which runs a separate server process), Transformers.js loads models in-process. The first run downloads model weights from HuggingFace Hub and caches them locally. Subsequent runs load from cache in under 2 seconds.

Runtime: @huggingface/transformers v3, WASM backend
Cache location: ~/.openaccountant/models/
API key: None required
External server: None required

Supported Models

WebGPU Models

Require OPENACCOUNTANT_TRANSFORMERS_DEVICE=webgpu and Bun 1.2+.

Model	Size	Description
`onnx-community/granite-4.0-micro-ONNX-web`	~3B	IBM Granite Micro, tool-calling support
`onnx-community/LFM2-1.2B-Tool-ONNX`	~1.2B	Purpose-built for tool use
`onnx-community/granite-4.0-350m-ONNX-web`	~350M	IBM Granite, fast inference
`onnx-community/Qwen3-0.6B-ONNX`	~0.6B	Qwen3 architecture

CPU/WASM Models

Work on any hardware. No GPU required.

Model	Size	Description
`HuggingFaceTB/SmolLM3-3B-ONNX`	~2 GB	92.3% BFCL score, best CPU option
`onnx-community/Qwen2.5-1.5B-Instruct`	~900 MB	Solid instruction following

Configuration

Set the compute device via environment variable:

# Default — runs on CPU via WASM (broadest compatibility)
OPENACCOUNTANT_TRANSFORMERS_DEVICE=cpu

# GPU-accelerated — requires Bun 1.2+ and a WebGPU-capable system
OPENACCOUNTANT_TRANSFORMERS_DEVICE=webgpu

Selecting a Model

Switch interactively with /model:

/model transformers:HuggingFaceTB/SmolLM3-3B-ONNX

Or set as default in .env:

DEFAULT_MODEL=transformers:HuggingFaceTB/SmolLM3-3B-ONNX

Pre-Downloading Models

Download a model before first use to avoid the wait during inference:

/pull transformers:HuggingFaceTB/SmolLM3-3B-ONNX

Wilson downloads all model files from HuggingFace Hub and caches them to ~/.openaccountant/models/. After this, the model loads from cache in under 2 seconds.

Tool Calling

Transformers.js models use prompt injection for tool calling. Wilson injects available tool schemas into the system prompt and parses the model’s response for tool call blocks:

<tool_call>{"name": "search_transactions", "arguments": {"query": "groceries"}}</tool_call>

This works reliably for single-tool calls — asking the model to categorize a transaction or look up spending. It becomes unreliable for multi-tool chains where the model needs to call several tools in sequence.

Limitations

Model quality. Small models (0.5B–3B) are far less capable than Claude or GPT for complex reasoning.
First-run download. Models download from HuggingFace Hub on first use (~270 MB–2 GB depending on model). Use /pull to pre-download.
Tool calling reliability. Prompt-injected tool calling works for simple single-tool calls. Complex multi-tool agent tasks often fail.
No streaming. Responses are returned in full after generation completes.

When to Use Transformers.js

Use Case	Transformers.js	Better Alternative
Offline, no internet	Yes	—
Simple Q&A about your data	Yes	—
Transaction categorization	Yes	—
Multi-tool agent tasks	No	Ollama or cloud provider
Complex financial analysis	No	Cloud provider
Fine-tuned model deployment	No	Ollama

Transformers.js vs Ollama

	Transformers.js	Ollama
Setup	Zero config, auto-downloads	Install Ollama, start server, pull model
Server process	None (in-process)	Separate server on port 11434
API key	None	None
Model format	ONNX	GGUF
GPU support	WebGPU (Bun 1.2+)	Metal, CUDA, ROCm
Model selection	~6 ONNX models	Thousands of GGUF models
Tool calling	Prompt injection (unreliable for chains)	Native tool calling (model-dependent)
Fine-tuned models	Limited (ONNX format only)	Full support (GGUF + Modelfile)
Best for	Offline, simple tasks, no setup	Local inference with full model ecosystem