Skip to content

Deploying Trained Models

After fine-tuning a LoRA adapter (see Fine-Tuning with LoRA), you need to deploy it so Wilson can use it. Two paths are available: Ollama (recommended) and Transformers.js (experimental).

Ollama is the primary deployment target. It supports GGUF models with native tool calling, GPU acceleration, and the full model ecosystem.

If you haven’t already converted your adapter, use llama.cpp:

Terminal window
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python convert_lora_to_gguf.py \
--base Qwen/Qwen2.5-7B-Instruct \
--lora ../wilson-sft-adapter \
--outfile wilson-adapter.gguf

See Fine-Tuning with LoRA — Step 4 for full conversion details including standalone model merging.

Wilson’s generateOllamaModelfile() function produces a Modelfile with the correct system prompt and parameters. The format:

FROM Qwen/Qwen2.5-7B-Instruct
ADAPTER ./wilson-adapter.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
SYSTEM """You are Wilson, an AI bookkeeper from Open Accountant. You help users manage their personal finances with precision and care. You have access to tools for importing transactions, categorizing spending, analyzing budgets, and generating reports."""

Save this as Modelfile in the same directory as your GGUF adapter.

Terminal window
ollama create wilson-7b -f Modelfile

Verify it works:

Terminal window
ollama run wilson-7b "Categorize: Whole Foods $45.23"

Switch Wilson to your fine-tuned model:

Terminal window
# Set as default for all sessions
echo 'DEFAULT_MODEL=ollama:wilson-7b' >> .env
# Or switch interactively
/model ollama:wilson-7b

Transformers.js supports ONNX-format models running in-process. This path is experimental and has limitations.

  • Your adapter and base model must be in ONNX format
  • The model must be compatible with @huggingface/transformers v3
  • Tool calling uses prompt injection (less reliable than Ollama’s native support)

Use the transformers: prefix to select your model:

/model transformers:your-username/wilson-onnx

See Transformers.js Local Inference for supported models and configuration.

Run a few representative queries to confirm your model works:

Terminal window
# Test basic categorization
wilson --run "Categorize my last 5 transactions"
# Test tool calling
wilson --run "What did I spend on groceries this month?"
# Test financial reasoning
wilson --run "Am I on track with my budget?"

Run the same queries with a cloud model and your fine-tuned model side by side:

Terminal window
# Cloud model baseline
DEFAULT_MODEL=claude-sonnet-4-6 wilson --run "Summarize my spending this week"
# Your fine-tuned model
DEFAULT_MODEL=ollama:wilson-7b wilson --run "Summarize my spending this week"

Look for correct tool usage, accurate categorization, and well-structured responses.

Deployment isn’t the end — it’s the start of a feedback loop.

1. Use your trained model (wilson-7b)
2. Wilson captures all interactions automatically
3. Review interactions in the Dashboard Training tab
4. Create DPO pairs: trained model = rejected, cloud model = chosen
5. Export new training data (SFT + DPO)
6. Fine-tune v2
7. Deploy v2, repeat

Each iteration improves your model. The DPO pairs teach it specifically where it falls short compared to cloud models, while SFT on good interactions reinforces correct behavior.

  • Rate aggressively. Give 1-2 stars to responses that miss the mark. 4-5 stars to good ones.
  • Focus DPO pairs on tool calling. The biggest gap between small and large models is tool usage.
  • Re-train every 100+ new annotations. Smaller batches risk overfitting.
  • Version your models. Name them wilson-7b-v1, wilson-7b-v2, etc. Keep old versions until the new one proves better.