Back to all workshops
WorkshopAdvanced

Improving Agents Over Time — Data → Fine-Tune → Redeploy

Capture Runs. Curate Data. Fine-Tune Models. Ship Better Agents.

Learn the full agent improvement lifecycle: capture good and bad agent runs, curate a training dataset, fine-tune or distill a model on Nebius, and redeploy behind the same workflow. All running on Nebius Serverless + Token Factory — no infrastructure changes needed to go from prototype to fine-tuned production model.

Jump to Step-by-Step Guide

Who This Is For

AI engineers, ML-adjacent developers, teams with agents already in production

Key Value

Continuous agent improvement loop from data capture to fine-tuned deployment

You'll Say

"We fine-tuned a 7B model that outperforms GPT-4 on our specific workflow — and it runs 10x cheaper"

What You'll Build

1

A data capture pipeline that logs agent runs with quality labels

2

A curated fine-tuning dataset from real agent interactions

3

A fine-tuned model deployed on Token Factory behind your existing workflow

What We'll Cover

  • Logging agent runs: what to capture, how to label good vs. bad outputs
  • Dataset curation: filtering, formatting, and quality control for fine-tuning
  • Fine-tuning on Nebius: model selection, training configuration, evaluation
  • Distillation: using a large model's outputs to train a smaller, cheaper model
  • Redeployment: swap the fine-tuned model into your workflow without changing code
  • Monitoring: tracking whether the fine-tuned model actually improves outcomes

Schedule

12:00 PM – 12:30 PM

The Agent Improvement Lifecycle

Why fine-tuning matters and how the data → train → deploy loop works

  • Why base models plateau on specific tasks — and how fine-tuning breaks through
  • The lifecycle: capture runs → label quality → curate dataset → fine-tune → redeploy → monitor
  • When to fine-tune vs. when to improve prompts (and how to tell the difference)
12:30 PM – 1:15 PM

Hands-On: Data Capture & Dataset Curation

Set up run logging, label outputs, and build a training dataset

  • Add structured logging to your agent to capture inputs, outputs, and tool calls
  • Label runs as good/bad/needs-improvement using a simple scoring system
  • Filter and format data into fine-tuning format (chat completion format)
  • Quality control: remove PII, deduplicate, balance positive/negative examples
1:15 PM – 2:00 PM

Fine-Tune & Deploy

Fine-tune a model on Nebius and deploy it behind your existing workflow

  • Choose a base model for fine-tuning (Llama 3.1 8B or 70B)
  • Configure and launch a fine-tuning job on Nebius
  • Evaluate the fine-tuned model against the base model
  • Deploy the fine-tuned model on Token Factory
2:00 PM – 2:30 PM

Distillation, Monitoring & Next Steps

Advanced patterns: model distillation, A/B testing, and continuous improvement

  • Distillation: use GPT-4 or Llama 70B outputs to train a fast 7B model
  • A/B testing: route traffic between base and fine-tuned models
  • Monitoring: track accuracy, latency, and cost over time
  • Q&A and production deployment patterns

Prerequisites

  • Laptop with a browser and terminal access
  • A Nebius AI Cloud account with Token Factory access
  • An existing agent workflow (or complete Workshop 1 first)

You'll Leave With

A data capture pipeline for agent runs
A curated fine-tuning dataset
A fine-tuned model running on Token Factory
A redeployment workflow that swaps models without downtime
Monitoring queries to track improvement over time

Step-by-Step Guide

Follow these steps during the workshop. Each step includes commands you can copy, tips from our mentors, and a checkpoint to verify before moving on.

Step 1~10 min

Set Up Your Agent Stack

Deploy OpenClaw on Nebius Serverless with Token Factory — or use your existing setup from Workshop 1.

Instructions

  1. 1.If you completed Workshop 1, verify your agent is still running
  2. 2.If starting fresh, deploy OpenClaw on Serverless and connect to Token Factory
  3. 3.Ensure you have a working agent workflow to capture data from

Commands

# Verify existing endpoint (if you have one)
nebius msp serverless v1alpha1 endpoint get-by-name --name openclaw-agent
# Or deploy fresh (see Workshop 1 for full setup)
nebius msp serverless v1alpha1 endpoint create \
--name openclaw-lifecycle \
--container-image openclaw:latest \
--container-template-resources-platform cpu-d3 \
--container-template-resources-preset 4vcpu-16gb \
--port 8080 \
--username admin \
--password "$(openssl rand -hex 32)" \
--network-id <your-network-id> \
--parent-id <your-project-id>

Checkpoint

Your OpenClaw agent is running and processing requests on Nebius Serverless.

Step 2~10 min

Add Run Logging

Instrument your agent to log every run — inputs, outputs, tool calls, and timing — in a structured format.

Instructions

  1. 1.Add a logging middleware to your agent that captures each interaction
  2. 2.Log the full message history, tool calls, and final output
  3. 3.Include timing data (latency per step) and token counts
  4. 4.Store logs in JSONL format for easy processing later

Tips

Log everything at first — you can filter later. It's much harder to add logging retroactively.
Use JSONL (one JSON object per line) — it's the easiest format to process and filter

Checkpoint

Running your agent produces a JSONL log file with full interaction details for each run.

Step 3~10 min

Label and Score Runs

Go through captured runs and label them as good, bad, or needs-improvement. This is your training signal.

Instructions

  1. 1.Review 20-30 agent runs from your logs
  2. 2.Score each run: 1 (bad), 2 (acceptable), 3 (good)
  3. 3.For bad runs, note what went wrong (wrong tool call, bad extraction, hallucination)
  4. 4.For good runs, note what made them good (correct output, efficient tool use)

Tips

You don't need thousands of examples. 50-100 high-quality labeled runs can dramatically improve a model.
Focus on runs where the model almost got it right — those are the most valuable for fine-tuning.

Checkpoint

You have 30+ labeled runs with quality scores and notes on what went right/wrong.

Step 4~15 min

Curate the Fine-Tuning Dataset

Transform your labeled runs into a fine-tuning dataset in chat completion format.

Instructions

  1. 1.Filter to only good runs (score 3) and corrected versions of bad runs
  2. 2.Format each example as a chat completion (system + user + assistant messages)
  3. 3.Remove any PII or sensitive data from the training examples
  4. 4.Split into training (90%) and evaluation (10%) sets

Tips

Quality over quantity — 50 perfect examples beat 500 noisy ones
Include the tool call format in your training data so the model learns your specific tool schema

Checkpoint

You have a training JSONL file and an eval JSONL file, both in chat completion format.

Step 5~15 min

Fine-Tune on Nebius

Launch a fine-tuning job on Nebius using your curated dataset. We'll fine-tune a Llama model to specialize on your workflow.

Instructions

  1. 1.Upload your training and eval datasets to Nebius
  2. 2.Configure the fine-tuning job: base model, learning rate, epochs
  3. 3.Launch the job and monitor progress
  4. 4.Evaluate the fine-tuned model against the base model on your eval set

Tips

Start with Llama 3.1 8B for fast iteration. Move to 70B once you've validated the approach.
2-3 epochs is usually enough. More epochs can lead to overfitting on small datasets.

Checkpoint

Your fine-tuning job completes and the eval shows improvement over the base model.

Step 6~10 min

Deploy the Fine-Tuned Model

Deploy your fine-tuned model on Token Factory and swap it into your existing agent workflow — no code changes needed.

Instructions

  1. 1.Deploy the fine-tuned model to a Token Factory endpoint
  2. 2.Update your OpenClaw configuration to point at the new model
  3. 3.Run the same test cases through both models to compare
  4. 4.Switch traffic to the fine-tuned model

Tips

The beauty of this architecture: swap model names in your config, everything else stays the same
Keep the base model endpoint active for A/B testing and fallback

Checkpoint

Your agent is running on the fine-tuned model and producing better results than the base model.

Step 7~10 min

Monitor and Iterate

Set up monitoring to track whether your fine-tuned model actually improves outcomes over time, and plan the next iteration.

Instructions

  1. 1.Set up automated quality checks on agent outputs
  2. 2.Track key metrics: accuracy, latency, cost per run, error rate
  3. 3.Plan a schedule for the next data capture → fine-tune cycle (weekly or monthly)
  4. 4.Document the process so your team can run it independently

Commands

# Check endpoint metrics
nebius msp serverless v1alpha1 endpoint get $ENDPOINT_ID
# View recent logs
nebius msp serverless v1alpha1 endpoint logs $ENDPOINT_ID

Tips

The first fine-tune is the biggest improvement. Each subsequent cycle gives smaller but compounding gains.
Distillation shortcut: use a large model (70B or GPT-4) to generate 'perfect' outputs, then fine-tune a small model (7B or 8B) on those outputs. Same quality, 10x cheaper inference.

Checkpoint

You have monitoring in place, a documented process, and a plan for the next fine-tuning cycle.

Ready to Build?

RSVP required. Spots are limited since we provide hands-on support for every attendee.

Register Now