Deploying Trained Models

After fine-tuning a LoRA adapter (see Fine-Tuning with LoRA), you need to deploy it so Wilson can use it. Two paths are available: Ollama (recommended) and Transformers.js (experimental).

Deploy to Ollama

Ollama is the primary deployment target. It supports GGUF models with native tool calling, GPU acceleration, and the full model ecosystem.

Convert to GGUF

If you haven’t already converted your adapter, use llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

python convert_lora_to_gguf.py \
  --base Qwen/Qwen2.5-7B-Instruct \
  --lora ../wilson-sft-adapter \
  --outfile wilson-adapter.gguf

See Fine-Tuning with LoRA — Step 4 for full conversion details including standalone model merging.

Create a Modelfile

Wilson’s generateOllamaModelfile() function produces a Modelfile with the correct system prompt and parameters. The format:

FROM Qwen/Qwen2.5-7B-Instruct
ADAPTER ./wilson-adapter.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"

SYSTEM """You are Wilson, an AI bookkeeper from Open Accountant. You help users manage their personal finances with precision and care. You have access to tools for importing transactions, categorizing spending, analyzing budgets, and generating reports."""

Save this as Modelfile in the same directory as your GGUF adapter.

Build and Register

ollama create wilson-7b -f Modelfile

Verify it works:

ollama run wilson-7b "Categorize: Whole Foods $45.23"

Set as Default

Switch Wilson to your fine-tuned model:

# Set as default for all sessions
echo 'DEFAULT_MODEL=ollama:wilson-7b' >> .env

# Or switch interactively
/model ollama:wilson-7b

Deploy to Transformers.js (Experimental)

Transformers.js supports ONNX-format models running in-process. This path is experimental and has limitations.

Requirements

Your adapter and base model must be in ONNX format
The model must be compatible with @huggingface/transformers v3
Tool calling uses prompt injection (less reliable than Ollama’s native support)

Usage

Use the transformers: prefix to select your model:

/model transformers:your-username/wilson-onnx

See Transformers.js Local Inference for supported models and configuration.

Verify the Deployment

Smoke Test

Run a few representative queries to confirm your model works:

# Test basic categorization
wilson --run "Categorize my last 5 transactions"

# Test tool calling
wilson --run "What did I spend on groceries this month?"

# Test financial reasoning
wilson --run "Am I on track with my budget?"

Compare Against Cloud

Run the same queries with a cloud model and your fine-tuned model side by side:

# Cloud model baseline
DEFAULT_MODEL=claude-sonnet-4-6 wilson --run "Summarize my spending this week"

# Your fine-tuned model
DEFAULT_MODEL=ollama:wilson-7b wilson --run "Summarize my spending this week"

Look for correct tool usage, accurate categorization, and well-structured responses.

The Iterative Improvement Loop

Deployment isn’t the end — it’s the start of a feedback loop.

1. Use your trained model (wilson-7b)
2. Wilson captures all interactions automatically
3. Review interactions in the Dashboard Training tab
4. Create DPO pairs: trained model = rejected, cloud model = chosen
5. Export new training data (SFT + DPO)
6. Fine-tune v2
7. Deploy v2, repeat

Each iteration improves your model. The DPO pairs teach it specifically where it falls short compared to cloud models, while SFT on good interactions reinforces correct behavior.

Tips for Iteration

Rate aggressively. Give 1-2 stars to responses that miss the mark. 4-5 stars to good ones.
Focus DPO pairs on tool calling. The biggest gap between small and large models is tool usage.
Re-train every 100+ new annotations. Smaller batches risk overfitting.
Version your models. Name them wilson-7b-v1, wilson-7b-v2, etc. Keep old versions until the new one proves better.