Create Your Own Locally-Run LLM Conversational Virtual Agent in Unity – Part 1: Backend

As someone who has worked extensively on conversational agents in XR healthcare applications, I’ve encountered the limitations of relying on cloud APIs for real-time interaction—especially in latency-sensitive, privacy-critical, or cost-conscious scenarios.

In two of my recent XR projects:

I explored how locally-run virtual agents powered by LLMs and multimodal input/output pipelines can improve both patient trust and physician control. This work was also featured in the German documentary “KI: Die Medizin von morgen” (AI: The Medicine of Tomorrow), which aired on April 15, 2025, as part of BR’s Gesundheit! program.

Now, I want to help you build your own fully local, real-time conversational AI agent in Unity—perfect for game characters, immersive XR assistants, or educational avatars.

What You’ll Build

In this two-part tutorial, I’ll walk through a complete setup for creating a locally-hosted, offline-capable AI assistant inside Unity.

This post (Part 1) will focus on the backend pipeline:

  • Speech-to-Text (STT): Captures microphone input and transcribes it to text
  • Large Language Model (LLM): Generates intelligent, context-aware responses
  • Text-to-Speech (TTS): Converts the response into natural-sounding speech

In Part 2, I’ll cover how to create the avatar front-end, including:

  • Humanoid Character Creation
  • Lip Sync
  • Animation

Why Unity?

Unity isn’t just for games. It’s a powerful tool for:

  • Creating expressive avatars with gesture and animation support.
  • Deploying immersive XR experiences with most AR/VR devices out of the box.
  • Integrating multimodal input/output (speech, gaze, hand tracking, etc.) for a more natural AI interaction.

But maybe I will also switch to Gadot one day 🙂


Step 1: Speech-to-Text with Whisper

Option A: 

Run Whisper Inside Unity

whisper.unity is a bindings for the whisper.cpp which uses OpenAI’s Whisper STT model. It brings transcription directly into your project. Just follow the readme file and their sample projects to set it up in your Unity project.

Pros:

  • Everything runs inside Unity.
  • Great for PC-based agents.

Cons:

  • Not suitable for standalone AR/VR headsets (e.g., Meta Quest) due to high compute demands.

📝 I used whisper.unity in my mixed reality study where transcription speed and local privacy were essential. The tiny model balanced accuracy and latency well.


Option B: 

Stream Audio to a Whisper Server

For deployment on lightweight clients or XR headsets:

  1. Set up a Whisper server on a local machine with GPU/CPU acceleration (e.g., using whisper.cpp or OpenAI’s Whisper in Python).
  2. In Unity, stream microphone input to the server using WebSocket or HTTP.
  3. Receive and parse the transcription result in Unity.

This offloads STT compute while keeping everything on your LAN. I have been using RealtimeSTT for this. The repo only provides you with the Python implementation of the server and client, but I have made a C# version of the client based on the Python version, you can just use it in your Unity project, and runs the Python server somewhere else.

I would suggest you create a virtual Python environment to install all the packages 🙂


Step 2: LLM for Response Generation

Option A: 

Run LLM for Unity (Local Inference)

LLM for Unity is a fantastic Unity package that allows you to run LLMs like Gemma3 or Deepseek inside Unity. Setup is very easy to follow using their Readme file and it supports local GGUF models.

  • I recommend it for prototyping on desktop.
  • Models like Phi-3, Gemma3, or Qwen3 can work and depending on your machine specs, you run small (quicker but dumber response) or large (slower but smarter response) size version of these models.

Option B: 

Run LLM as a HTTP Server

In this setup you will need to query LLM from Unity using HTTP requests.

LLM for Unity does also support remote server mode, just follow the readme file to setup.

In addition, if you would like to, you can also use LM Studio to setup as a server. Here is a more detailed tutorial:

Step 3: Text-to-Speech with KoKoro

For TTS, I’ve evaluated several open models and settled on KoKoro 1.0, an open-source voice model with natural tone and low latency.

Why KoKoro?

  • Ranked highly as a free model on TTS Arena.
  • Supports FastAPI wrappers built by the community.

Setup:

  1. Run KoKoro as a FastAPI server using Kokoro-FastAPI.
  2. In Unity, send a POST request with your text.
  3. Receive the audio file or stream and play it using Unity’s AudioSource using this C# script.

This pipeline keeps voice output consistent with your avatar’s tone and response timing.


Backend Pipeline Recap

🎤 User speaks into microphone
⬇️
📝 Whisper transcribes speech
⬇️
🧠 LLM generates a response
⬇️
🔊 KoKoro TTS converts response to audio
⬇️
🎧 Unity plays the audio to the user

Coming Next: Avatar, Animation, and Lip Sync

In Part 2, I’ll walk you through:

  • Creating and rigging a humanoid avatar
  • Driving gestures and gaze using LLM outputs
  • Lip-syncing with your TTS audio
  • Building immersive, emotionally intelligent agents

If you’d like demo scripts or starter projects for any of these modules, just let me know in the comments or message me directly. I have open sourced my virtual agent implementation on Github as well:

https://github.com/stytim/agent-pipeline-minimal

Stay tuned for Part 2: Bringing Your Virtual Agent to Life in Unity.