Benchmark Results

Last updated April 8, 2026

How fine-tuned open models compare to the world's best AI, tested across 5 real-world tasks and judged by Perplexity as an independent evaluator.

Overview

We compared fine-tuned models against four frontier AI systems and their base (unmodified) counterparts across five real-world tasks. Each task represents a practical use case where businesses commonly use AI: classifying customer messages, extracting data from documents, writing product descriptions, reading medical notes, and simplifying legal language.

The goal was simple: find out how close a custom-trained open model can get to the most powerful AI models available today, and where it still falls short.

Task Categories

Models Compared

4 frontier, 1 base, 1 fine-tuned

250

Evaluation Prompts

Methodology

For each task, we used Perplexity to generate an evaluation dataset of 50 prompts spanning easy, medium, and hard difficulty levels. Every model received the same prompts with the same system instructions. Responses were then scored by Perplexity as an independent judge on a 0 to 5 scale (or exact-match accuracy for classification tasks).

Models Tested

Model	Type	Provider
Claude Opus 4.6	Frontier API	Anthropic
Claude Sonnet 4.6	Frontier API	Anthropic
GPT-5	Frontier API	OpenAI
GPT-5.4	Frontier API	OpenAI
Qwen3 (Base)	Open-source baseline	Alibaba
★ Qwen3 (Fine-tuned via TuneSalon)	Fine-tuned	TuneSalon AI

Training Setup

Each fine-tuned model was trained with approximately 500 examples, 3 epochs, and LoRA rank 16. The same training configuration was used across all tasks to ensure fair comparison. No additional prompt engineering or few-shot examples were used for any model.

Customer Support Classification

Task: Classify customer messages into one of 5 categories (Billing, Technical Issue, Feature Request, Account Access, General Inquiry).

Base model: Qwen3 (4B) on A100 GPU, the smallest model in our lineup.

Classification Accuracy (%)

50 customer messages, exact-match scoring

Claude Sonnet 4.6

100%

Claude Opus 4.6

96%

GPT-5.4

96%

GPT-5

94%

★ Fine-tuned Qwen3-4B

94%

Base Qwen3-4B

92%

A 4-billion parameter model, fine-tuned with 500 examples, matches GPT-5's accuracy on customer support classification and comes within 6% of a perfect score.

Fine-tuning improved accuracy most on edge cases: ambiguous messages where the base model confused "Account Access" with "Technical Issue", and feature requests that the base model sometimes mislabelled. On easy messages, both base and fine-tuned models were nearly perfect.

Honest take: Classification is a relatively straightforward task, and all models performed well here. The 2% improvement from fine-tuning (92% to 94%) is modest. For simple categorisation tasks, even a small base model can already do a solid job. Where fine-tuning adds the most value is in teaching the model your specific categories and edge cases that generic models might handle differently.

Structured Output (Invoice Extraction)

Task: Extract structured JSON from invoice text: vendor name, invoice number, date, total amount, and line items. Prompts ranged from clean invoices to messy OCR-like text with typos and missing fields.

Base model: Qwen3 (8B) on A100 GPU.

Extraction Quality (0 to 5 scale)

50 invoices, judged on schema compliance + field accuracy

Claude Opus 4.6

4.8/5

GPT-5

4.7/5

Claude Sonnet 4.6

4.6/5

GPT-5.4

4.5/5

★ Fine-tuned Qwen3-8B

4.1/5

Base Qwen3-8B

3.8/5

Fine-tuning eliminated the base model's biggest problem: outputting reasoning text mixed into the JSON. The fine-tuned model reliably produces clean, parseable JSON and hallucinates fewer invoice numbers on ambiguous inputs.

On easy invoices, all models performed nearly identically. The gap appeared on medium and hard prompts with discounts, deposits, GST calculations, and OCR noise. Frontier models handled ambiguous totals (net vs gross) more consistently.

Honest take: Frontier models still hold a meaningful lead here (4.5 to 4.8 vs 4.1). On messy, real-world invoices with OCR noise and ambiguous totals, the fine-tuned model still struggles compared to larger frontier systems. However, the fine-tuned model's strength is producing reliably parseable output. It never dumps reasoning text into the JSON and is more conservative about hallucinating missing data, which matters in production pipelines.

E-Commerce Product Copywriting

Task: Write a concise product description from structured specs (name, materials, dimensions, features). Judged on factual accuracy, feature coverage, and style quality.

Base model: Qwen3 (14B) on A100 GPU.

Description Quality (0 to 5 scale)

48 products, judged on factual accuracy + feature coverage + style

Claude Opus 4.6

4.7/5

Claude Sonnet 4.6

4.6/5

GPT-5

4.6/5

GPT-5.4

4.4/5

★ Fine-tuned Qwen3-14B

4.1/5

Base Qwen3-14B

3.8/5

Fine-tuning fixed the base model's repetition problem (product names appearing multiple times in a single description) and improved feature coverage from roughly 75 to 80% up to 85 to 90% of listed product features mentioned correctly.

The fine-tuned model was notably the most factually conservative among all models tested. It had the lowest rate of hallucinated marketing claims, sticking closely to the product specs without inventing features like "military-grade protection" that were not in the brief.

Honest take: Frontier models are clearly stronger on stylistic polish and "copywriter quality" writing. The fine-tuned model scores 4.1 vs 4.6 to 4.7 for top frontier models, a noticeable gap. Where the fine-tuned model wins is on factual discipline: it preserves all dimensions, capacities, and warranty details without embellishment. For brands that prioritise accuracy over flair, that trade-off may be worthwhile.

Medical Record Extraction

Task: Extract patient concern, symptoms, recommended treatment, and follow-up status from clinical notes. Output as structured JSON.

Base model: Qwen3 (32B) on H200 GPU.

Extraction Quality (0 to 5 scale)

50 clinical notes, judged on field accuracy + completeness

Claude Opus 4.6

4.4/5

Claude Sonnet 4.6

4.3/5

GPT-5

4.3/5

GPT-5.4

4.15/5

★ Fine-tuned Qwen3-32B

4.05/5

Base Qwen3-32B

3.65/5

Fine-tuning closed nearly half the gap to frontier models (+0.4 improvement). The biggest gain was in the recommended treatment field, where the base model frequently left it blank, while the fine-tuned model learned to provide specific treatment plans matching clinical practice patterns.

An interesting finding: frontier models sometimes scored lower because they were "too smart", adding guideline-level workup recommendations instead of extracting what the note actually said. The fine-tuned model better matched the expected extraction style, especially for follow-up flags (correctly distinguishing "yes" for chronic conditions vs "no" for routine cosmetic procedures).

Honest take: This was the closest race. The fine-tuned model (4.05) nearly matches GPT-5.4 (4.15) and is not far behind Opus (4.4). The gap to frontier is real but narrow, and part of the frontier models' advantage comes from medical knowledge the base model simply does not have. With more domain-specific training data, this gap could potentially narrow further.

Legal Clause Rewriting

Task: Rewrite complex legal contract clauses into plain English that a non-lawyer can understand, while preserving every obligation, condition, deadline, and party reference.

Base model: Qwen3 (32B) on H200 GPU.

Rewrite Quality (0 to 5 scale)

50 legal clauses, judged on meaning preservation + simplification

Claude Opus 4.6

4.7/5

Claude Sonnet 4.6

4.6/5

GPT-5

4.6/5

GPT-5.4

4.45/5

★ Fine-tuned Qwen3-32B

4.45/5

Base Qwen3-32B

4.2/5

The fine-tuned model matched GPT-5.4's score and came within 0.25 points of Claude Opus. It learned to explicitly restate each legal qualifier ("non-exclusive, non-transferable, non-sublicensable") in simple terms rather than glossing over them.

The fine-tuned model preserved temporal details (like "2-year post-employment period") that the base model sometimes dropped, and hit the target tone most consistently. Frontier models added helpful extras like mini-glossaries, which is useful but goes beyond the rewrite brief.

Honest take: All models performed near-ceiling on this task. The differences are mainly in style and completeness of nuance, not in legal accuracy. Fine-tuning's value here is in learning a specific rewriting style (short sentences, explicit restatement of each clause) that the base model does not naturally produce. Frontier models still edge ahead on overall polish.

Summary

Customer Support

Base Model92%

★ Fine-tuned94%

Best (Sonnet)100%

Invoice Extraction

Base Model3.8 / 5

★ Fine-tuned4.1 / 5

Best (Opus)4.8 / 5

E-Commerce Copy

Base Model3.8 / 5

★ Fine-tuned4.1 / 5

Best (Opus)4.7 / 5

Medical Extraction

Base Model3.65 / 5

★ Fine-tuned4.05 / 5

Best (Opus)4.4 / 5

Legal Rewrite

Base Model4.2 / 5

★ Fine-tuned4.45 / 5

Best (Opus)4.7 / 5

Average Improvement

+0.3

points over base model
across all tasks (0 to 5 scale)

Across all five tasks, fine-tuning consistently improved the base model by an average of 0.3 points on a 0 to 5 scale. The improvement was most dramatic on tasks where the base model had clear weaknesses: empty treatment fields in medical extraction, broken JSON in invoice parsing, repetitive text in e-commerce copy.

Frontier models still lead on every task, typically by 0.3 to 0.7 points over the fine-tuned model. The gap is smallest on tasks where domain-specific patterns matter most (medical extraction, legal rewriting) and largest on tasks requiring broad general intelligence (invoice parsing, creative copywriting).

What This Means for You

These results tell an honest story: frontier models like Claude Opus and GPT-5 are still the best general-purpose AI available. Fine-tuning does not beat them on raw quality. But it gets you remarkably close, and it comes with practical advantages that API-only approaches cannot offer.

Your model learns your style

Frontier models need detailed system prompts and examples every time. A fine-tuned model just knows how you want it to respond: your format, your tone, your terminology. No prompt engineering required.

Consistent output, every time

API models can drift between calls with different formatting, varying levels of detail, and occasional surprises. Fine-tuned models produce the same structure and style reliably, which matters when you are processing hundreds or thousands of documents.

Full privacy with the desktop app

Every API call sends your data to a third party. With our upcoming desktop app, you can download your fine-tuned adapter and run it entirely on your own hardware. Sensitive medical records, legal documents, and customer data never leave your machine. Cloud training on our website still processes data through our servers, but the adapter you create is yours to take offline.

A model that is truly yours

No API rate limits. No surprise pricing changes. No model deprecation notices. Once you have trained your adapter, it is yours to keep and use however you want, on our cloud GPUs or on your own hardware.

The benchmark shows what fine-tuning can do with training on 500 examples. With more domain-specific data, your company's actual documents, your preferred format, your industry's terminology, the results only get better.

Platform Comparison Choose the Right Model What Is Fine-Tuning?

Benchmark Results

Overview

Methodology

Models Tested

Training Setup

Customer Support Classification

Classification Accuracy (%)

Structured Output (Invoice Extraction)

Extraction Quality (0 to 5 scale)

E-Commerce Product Copywriting

Description Quality (0 to 5 scale)

Medical Record Extraction

Extraction Quality (0 to 5 scale)

Legal Clause Rewriting

Rewrite Quality (0 to 5 scale)

Summary

Customer Support

Invoice Extraction

E-Commerce Copy

Medical Extraction

Legal Rewrite

Average Improvement

What This Means for You

Your model learns your style

Consistent output, every time

Full privacy with the desktop app

A model that is truly yours

Related