Running Local LLMs with AI Controller

This guide explains how to run local Large Language Models (LLMs) using llama.cpp and Ollama, and how to connect them to AI Controller for centralized management, monitoring, and governance.

Overview

Running LLMs locally provides several advantages:

Privacy and Security: Keep sensitive data within your infrastructure
Reduced Latency: Eliminate network delays when using models
Cost Efficiency: Avoid per-token or subscription fees from commercial providers
Customization: Fine-tune models for your specific use cases

AI Controller can integrate with these local LLM endpoints just like it does with commercial API providers, allowing you to centralize management and apply consistent governance policies.

Setting Up Local LLMs

There are two popular frameworks for running LLMs locally:

llama.cpp: A lightweight C/C++ implementation designed for efficient CPU and GPU inference
Ollama: A user-friendly tool that simplifies model management and provides API compatibility with OpenAI

Option 1: Setting Up llama.cpp Server

llama.cpp is an efficient C/C++ implementation that can run LLMs on consumer hardware. Here's how to set it up:

Clone and build llama.cpp:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake ..
cmake --build . --config Release

Download a GGUF model:
Download models in GGUF format from Hugging Face
Examples of popular models in GGUF format:
- Llama 3: https://huggingface.co/Meta-Llama/Llama-3-8B-Instruct-GGUF
- Mistral: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
- Phi-3: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-GGUF
Start the llama.cpp server:

# Basic server on port 8080
./llama-server -m /path/to/model.gguf --port 8080

# With more configuration options
./llama-server -m /path/to/model.gguf --port 8080 --ctx-size 4096 --threads 8

Access the server:
Web UI is available at http://localhost:8080
API endpoint is at http://localhost:8080/v1/chat/completions (OpenAI-compatible)

For more detailed instructions and configuration options, refer to the llama.cpp server documentation.

Option 2: Setting Up Ollama

Ollama provides a more user-friendly approach to running local LLMs with a focus on simplicity and compatibility with OpenAI's API.

Install Ollama:

Linux:

curl -fsSL https://ollama.com/install.sh | sh

macOS: Download from https://ollama.com
Windows: Download from https://ollama.com
Pull a model:

# Pull the latest Llama 3 model
ollama pull llama3

# Or pull other models like Mistral or Phi
ollama pull mistral
ollama pull phi3

Start the Ollama server:

# Start the server (default port is 11434)
ollama serve

Make the server accessible from your network (optional):

By default, Ollama only listens on localhost. To make it accessible from your network:

# Set environment variable before starting the server
OLLAMA_HOST=0.0.0.0 ollama serve

Test the API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Hello, how are you?",
  "stream": false
}'

For more information, visit the Ollama GitHub repository and API documentation.

Integrating with AI Controller

Once you have your local LLM server running (either llama.cpp or Ollama), you can integrate it with AI Controller:

Access the AI Controller admin interface
Navigate to Provider Management
Click "CREATE" to add a new provider
Configure the provider with these settings:

For llama.cpp server:

Name: local-llama-cpp (or another descriptive name)
URL: http://[server-ip]:8080/v1/chat/completions
Auth Type: None (or appropriate auth if configured)
API Key: None (or appropriate key if configured)

For Ollama:

Name: local-ollama (or another descriptive name)
URL: http://[server-ip]:11434/api/generate
Auth Type: None (no authentication required by default)
API Key: None (no API key required by default)
Click "CREATE" to save the provider configuration

Now you can create AI Controller API keys that use these local LLM providers, applying the same governance, monitoring, and security features that you would with commercial providers.

Usage Considerations

When using local LLMs with AI Controller:

Hardware Requirements: Ensure adequate CPU/GPU resources based on model size
Network Configuration: If running on separate machines, ensure proper network connectivity
Model Selection: Choose models based on:
- Size (smaller models require less RAM/VRAM)
- Capability (instruction-tuned models work best for chat applications)
- License (ensure compliance with model licenses)
Performance Expectations: Local models may be slower than commercial APIs depending on hardware
Security: Even for local endpoints, apply appropriate security policies in AI Controller

Example: Configuring a Local Endpoint for Different Models

You can configure multiple providers for different models running on the same local server. For instance, with Ollama:

Set up provider for Llama 3:
- Name: local-llama3
- URL: http://[server-ip]:11434/api/generate
- Include model name in the request payload
Set up provider for Mistral:
- Name: local-mistral
- URL: http://[server-ip]:11434/api/generate
- Include model name in the request payload

This approach allows you to use different local models for different use cases while maintaining centralized management through AI Controller.

Troubleshooting

Common issues when integrating local LLMs with AI Controller:

Issue	Possible Causes	Solutions
Connection timeout	- LLM server not running - Network issues	- Verify server is running - Check network connectivity
Model not loading	- Insufficient memory - Corrupted model file	- Check hardware resources - Re-download model
Slow responses	- Hardware limitations - Large context size	- Upgrade hardware - Reduce context size
Incompatible API	- Different API format	- Check API compatibility - Use OpenAI-compatible server mode

Resources

Updated: 2025-05-27