Skip to content

Transformers.js Local Inference

Transformers.js runs ONNX models directly inside Wilson’s Bun process via the @huggingface/transformers v3 WASM backend. No API key. No external server. No Ollama dependency.

Unlike Ollama (which runs a separate server process), Transformers.js loads models in-process. The first run downloads model weights from HuggingFace Hub and caches them locally. Subsequent runs load from cache in under 2 seconds.

  • Runtime: @huggingface/transformers v3, WASM backend
  • Cache location: ~/.openaccountant/models/
  • API key: None required
  • External server: None required

Require OPENACCOUNTANT_TRANSFORMERS_DEVICE=webgpu and Bun 1.2+.

ModelSizeDescription
onnx-community/granite-4.0-micro-ONNX-web~3BIBM Granite Micro, tool-calling support
onnx-community/LFM2-1.2B-Tool-ONNX~1.2BPurpose-built for tool use
onnx-community/granite-4.0-350m-ONNX-web~350MIBM Granite, fast inference
onnx-community/Qwen3-0.6B-ONNX~0.6BQwen3 architecture

Work on any hardware. No GPU required.

ModelSizeDescription
HuggingFaceTB/SmolLM3-3B-ONNX~2 GB92.3% BFCL score, best CPU option
onnx-community/Qwen2.5-1.5B-Instruct~900 MBSolid instruction following

Set the compute device via environment variable:

Terminal window
# Default — runs on CPU via WASM (broadest compatibility)
OPENACCOUNTANT_TRANSFORMERS_DEVICE=cpu
# GPU-accelerated — requires Bun 1.2+ and a WebGPU-capable system
OPENACCOUNTANT_TRANSFORMERS_DEVICE=webgpu

Switch interactively with /model:

/model transformers:HuggingFaceTB/SmolLM3-3B-ONNX

Or set as default in .env:

Terminal window
DEFAULT_MODEL=transformers:HuggingFaceTB/SmolLM3-3B-ONNX

Download a model before first use to avoid the wait during inference:

/pull transformers:HuggingFaceTB/SmolLM3-3B-ONNX

Wilson downloads all model files from HuggingFace Hub and caches them to ~/.openaccountant/models/. After this, the model loads from cache in under 2 seconds.

Transformers.js models use prompt injection for tool calling. Wilson injects available tool schemas into the system prompt and parses the model’s response for tool call blocks:

<tool_call>{"name": "search_transactions", "arguments": {"query": "groceries"}}</tool_call>

This works reliably for single-tool calls — asking the model to categorize a transaction or look up spending. It becomes unreliable for multi-tool chains where the model needs to call several tools in sequence.

  • Model quality. Small models (0.5B–3B) are far less capable than Claude or GPT for complex reasoning.
  • First-run download. Models download from HuggingFace Hub on first use (~270 MB–2 GB depending on model). Use /pull to pre-download.
  • Tool calling reliability. Prompt-injected tool calling works for simple single-tool calls. Complex multi-tool agent tasks often fail.
  • No streaming. Responses are returned in full after generation completes.
Use CaseTransformers.jsBetter Alternative
Offline, no internetYes
Simple Q&A about your dataYes
Transaction categorizationYes
Multi-tool agent tasksNoOllama or cloud provider
Complex financial analysisNoCloud provider
Fine-tuned model deploymentNoOllama
Transformers.jsOllama
SetupZero config, auto-downloadsInstall Ollama, start server, pull model
Server processNone (in-process)Separate server on port 11434
API keyNoneNone
Model formatONNXGGUF
GPU supportWebGPU (Bun 1.2+)Metal, CUDA, ROCm
Model selection~6 ONNX modelsThousands of GGUF models
Tool callingPrompt injection (unreliable for chains)Native tool calling (model-dependent)
Fine-tuned modelsLimited (ONNX format only)Full support (GGUF + Modelfile)
Best forOffline, simple tasks, no setupLocal inference with full model ecosystem