After You Export

Last updated April 11, 2026

You've exported your model. Here's how to use it: inside TuneSalon, on your local machine, or deployed to a live service.

Your Two Exports

When training finishes on TuneSalon, you have two ways to export your work. Each serves a different purpose.

.adapterLightweight add-on

Small file (typically under 200 MB)
Sits on top of the base model
Works in TuneSalon Chat and Library
Easy to swap, stack, and share

.ggufStandalone model

Larger file (1 GB and up, depending on model)
Base model + adapter merged into one file
Runs independently, no base model needed
Works outside TuneSalon with any compatible tool

Think of the .adapter as a lens that clips onto a camera. It only works with the right camera body (the base model). The .gguf is more like a fully assembled camera, ready to use on its own.

Using Within TuneSalon

If you just want to use your model for chatting or testing, you don't need to set up anything external.

Chat Tab (Web)

Go to the Chat tab, load the base model, and attach your .adapter file. You can load adapters from your Library or upload from your computer. You can even stack up to 5 adapters at once to combine different skills.

For most people using their model personally or testing it before deployment, this is all you need. The rest of this guide is for when you want your model running on a live website or service.

Deploying to Your Website or Service

Let's say you've fine-tuned a customer support model and you want it answering questions on your website, 24 hours a day. Or you've built a legal writing assistant and want it available as an API that your team's tools can call.

To make this work, your model needs to run on a server that's always on and fast enough to generate responses in real time. This requires a cloud GPU, because AI models need specialised hardware to run at a reasonable speed.

You don't need to build this from scratch. Platforms exist specifically for hosting AI models. You upload your .gguf file, and they give you an API endpoint: a URL that your website or app can send questions to and receive answers from.

What is an API endpoint?

An API endpoint is simply a URL that accepts requests and returns responses. Your website sends a message to the URL, and the model's response comes back. It works like a postal address for your AI: send a question to this address, get an answer back.

Option A: Hosted Platforms

This is the easiest path. These platforms handle the GPU hardware, server management, and scaling for you. You upload your model file and get an API endpoint.

Here are some of the platforms that support custom GGUF model hosting. Each has its own pricing, features, and setup process. We'd encourage you to explore them and pick what fits your needs.

Replicate

Upload your model through their web interface or CLI, and get a working API endpoint. Known for being beginner-friendly with clear documentation.

Replicate: Push a custom model →

Hugging Face Inference Endpoints

Upload your model to a Hugging Face repository and deploy it to a dedicated GPU with a few clicks. You get an API endpoint that stays running as long as you need it.

Hugging Face: Inference Endpoints docs →

These are some of the well-known options, but the space is growing quickly. Search for "GGUF model hosting" or "custom LLM inference hosting" to find more.

Option B: Self-Hosted (Advanced)

If you have technical experience and want full control over your infrastructure, you can rent a cloud GPU and run the inference server yourself. This gives you more flexibility over configuration, cost optimisation, and scaling, but requires hands-on setup.

The general approach is:

Rent a cloud GPU from a provider like Modal, RunPod, Lambda Labs, Vast.ai, or a major cloud (AWS, GCP, Azure).
Set up an inference server using software like llama.cpp (which natively supports GGUF files) or vLLM.
Expose an API endpoint that your website or service can call.

TuneSalon itself uses Modal for cloud GPU workloads. Modal lets you define GPU functions in Python and deploy them as serverless endpoints, meaning you only pay when your model is actually processing requests.

Who is this for?

Option B is best suited for developers or teams with experience managing servers and cloud infrastructure. If terms like "Docker", "SSH", and "API server" are unfamiliar, Option A (hosted platforms) will save you a lot of time and frustration.

A Note on Model Size

Larger models produce higher-quality responses but require more powerful (and more expensive) GPU hardware to run. When choosing where to deploy, you'll need a GPU with enough memory to hold your model.

The Model Guide lists the memory requirements for each model at full precision. For deployment, you can use quantization to reduce the memory needed. Quantization compresses the model into a smaller format, which lowers the GPU requirement. The trade-off is that model quality may decrease slightly, especially at higher compression levels.

Most hosting platforms let you choose a GPU tier when you deploy. Pick a GPU that comfortably fits your model. The platform's documentation will help you match your model size to the right hardware.

Connecting to Your Website

Once your model is deployed on a hosting platform (either Option A or B), you'll have an API endpoint: a URL that accepts messages and returns your model's responses.

The pattern is straightforward: your website sends a request to the endpoint with the user's message, waits for the model to generate a response, and displays it. Most hosting platforms provide an OpenAI-compatible API, which means the way you send and receive messages follows a widely-used standard.

Each platform provides its own integration guide with code examples for connecting from a website, mobile app, or backend service. Here are the relevant docs:

Replicate: Get started with Node.js
Hugging Face: JavaScript inference client
OpenAI Chat API reference (most platforms follow this same format)

Next Steps

You've trained a custom AI, exported it, and now know how to take it from TuneSalon to the rest of the world. Here are some related resources:

Model Guide · Find the right model for your use case and check memory requirements.
GPU Guide · Understand the difference between A100 and H200 cloud GPUs on TuneSalon.
Our Method · How LoRA adapters work and why they're efficient.
Train a model · Ready to build something new? Start training.

Benchmark Results Platform Comparison What Is Fine-Tuning?