Skip to content

Annotating Interactions

The Dashboard Training tab lets you browse every LLM interaction Wilson has made, inspect the full prompts and responses, rate quality, and create preference pairs for DPO training.

Terminal window
wilson # Dashboard starts at http://localhost:3141
/dashboard # Opens in your browser

Click the Training tab in the navigation bar.

The stats panel at the top shows:

StatDescription
TotalTotal LLM interactions recorded
AnnotatedHow many have been rated or labeled
SFT ReadyInteractions rated 4+ stars (eligible for SFT export)
DPO PairsNumber of chosen/rejected preference pairs

The main table lists all recorded interactions with:

  • ID — Database row ID
  • Run — First 8 characters of the run UUID
  • Typeagent, summarize, relevance, etc.
  • Model — Which LLM was used
  • Tokens — Total token count
  • Rating — Star rating if annotated
  • Time — When the call was made

Use the dropdowns above the table to filter by:

  • Type — Show only agent calls, summarize calls, etc.
  • Annotated — Show only annotated or unannotated interactions
  • Rating — Filter by minimum star rating

Click any row to expand its full detail panel:

  • System Prompt — The complete system prompt sent to the model
  • User Prompt — The user’s query or iteration prompt with tool results
  • Response — The model’s full text response
  • Tool Calls — JSON of any tool calls the model requested
  • Tool Results — Output of each tool execution

This gives you complete visibility into what the model saw and what it produced.

Rate each interaction from 1 to 5 stars:

RatingMeaningTraining Use
1 starBad — wrong answer, hallucination, off-topicExcluded from SFT, candidate for DPO “rejected”
2 starsPoor — partially correct but significant issuesExcluded from SFT
3 starsAcceptable — correct but could be betterExcluded from SFT by default
4 starsGood — correct and well-structuredIncluded in SFT export
5 starsExcellent — ideal response to learn fromIncluded in SFT export

Click the star icons in the annotation panel to set the rating. The default SFT export threshold is 4 stars — only high-quality interactions become training data.

Direct Preference Optimization (DPO) training requires pairs: a chosen response and a rejected response for the same prompt.

  1. Find two interactions with the same or similar user prompt but different responses
  2. Open the better response, set Preference to Chosen
  3. Enter a Pair ID (any string, e.g., groceries-1)
  4. Open the worse response, set Preference to Rejected
  5. Enter the same Pair ID

The pair is now linked. When you export DPO training data, both interactions are combined into a single training example with the prompt, chosen response, and rejected response.

  • Use descriptive pair IDs: categorize-groceries-1, spending-review-2
  • Each pair needs exactly one chosen and one rejected interaction with the same pair ID
  • You can create pairs across different models — useful for comparing a cloud model’s response against a local model’s response

Click Save to persist the annotation. Annotations are stored in the interaction_annotations table and survive across sessions.

Saving is an upsert — if an annotation already exists for the interaction, it’s replaced with the new values.

You can also annotate programmatically via the dashboard API:

Terminal window
# All interactions (paginated)
curl http://localhost:3141/api/interactions?limit=50&offset=0
# Filter by call type
curl http://localhost:3141/api/interactions?callType=agent
# Only unannotated
curl http://localhost:3141/api/interactions?annotated=false
Terminal window
curl http://localhost:3141/api/interactions/42

Returns the interaction with its tool results and annotations.

Terminal window
curl http://localhost:3141/api/runs/abc-123-uuid
Terminal window
curl -X POST http://localhost:3141/api/interactions/42/annotate \
-H "Content-Type: application/json" \
-d '{"rating": 5, "preference": "chosen", "pairId": "pair-1", "notes": "Great categorization"}'
Terminal window
curl http://localhost:3141/api/annotations/stats

Returns total interactions, annotated count, rating distribution, DPO pair count, and SFT-ready count.