Running Local LLMs with AI Controller
This guide explains how to run local Large Language Models (LLMs) using llama.cpp and Ollama, and how to connect them to AI Controller for centralized management, monitoring, and governance.
Overview
Running LLMs locally provides several advantages:
- Privacy and Security: Keep sensitive data within your infrastructure
- Reduced Latency: Eliminate network delays when using models
- Cost Efficiency: Avoid per-token or subscription fees from commercial providers
- Customization: Fine-tune models for your specific use cases
AI Controller can integrate with these local LLM endpoints just like it does with commercial API providers, allowing you to centralize management and apply consistent governance policies.
Setting Up Local LLMs
There are two popular frameworks for running LLMs locally:
- llama.cpp: A lightweight C/C++ implementation designed for efficient CPU and GPU inference
- Ollama: A user-friendly tool that simplifies model management and provides API compatibility with OpenAI
Option 1: Setting Up llama.cpp Server
llama.cpp is an efficient C/C++ implementation that can run LLMs on consumer hardware. Here's how to set it up:
- Clone and build llama.cpp:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake ..
cmake --build . --config Release
-
Download a GGUF model:
-
Download models in GGUF format from Hugging Face
-
Examples of popular models in GGUF format:
- Llama 3: https://huggingface.co/Meta-Llama/Llama-3-8B-Instruct-GGUF
- Mistral: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
- Phi-3: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-GGUF
-
Start the llama.cpp server:
# Basic server on port 8080
./llama-server -m /path/to/model.gguf --port 8080
# With more configuration options
./llama-server -m /path/to/model.gguf --port 8080 --ctx-size 4096 --threads 8
-
Access the server:
-
Web UI is available at http://localhost:8080
- API endpoint is at http://localhost:8080/v1/chat/completions (OpenAI-compatible)
For more detailed instructions and configuration options, refer to the llama.cpp server documentation.
Option 2: Setting Up Ollama
Ollama provides a more user-friendly approach to running local LLMs with a focus on simplicity and compatibility with OpenAI's API.
-
Install Ollama:
-
Linux:
- macOS: Download from https://ollama.com
-
Windows: Download from https://ollama.com
-
Pull a model:
# Pull the latest Llama 3 model
ollama pull llama3
# Or pull other models like Mistral or Phi
ollama pull mistral
ollama pull phi3
- Start the Ollama server:
- Make the server accessible from your network (optional):
By default, Ollama only listens on localhost. To make it accessible from your network:
- Test the API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Hello, how are you?",
"stream": false
}'
For more information, visit the Ollama GitHub repository and API documentation.
Integrating with AI Controller
Once you have your local LLM server running (either llama.cpp or Ollama), you can integrate it with AI Controller:
- Access the AI Controller admin interface
- Navigate to Provider Management
- Click "CREATE" to add a new provider
- Configure the provider with these settings:
For llama.cpp server:
- Name:
local-llama-cpp
(or another descriptive name) - URL:
http://[server-ip]:8080/v1/chat/completions
- Auth Type:
None
(or appropriate auth if configured) - API Key:
None
(or appropriate key if configured)
For Ollama:
- Name:
local-ollama
(or another descriptive name) - URL:
http://[server-ip]:11434/api/generate
- Auth Type:
None
(no authentication required by default) -
API Key:
None
(no API key required by default) -
Click "CREATE" to save the provider configuration
Now you can create AI Controller API keys that use these local LLM providers, applying the same governance, monitoring, and security features that you would with commercial providers.
Usage Considerations
When using local LLMs with AI Controller:
- Hardware Requirements: Ensure adequate CPU/GPU resources based on model size
- Network Configuration: If running on separate machines, ensure proper network connectivity
- Model Selection: Choose models based on:
- Size (smaller models require less RAM/VRAM)
- Capability (instruction-tuned models work best for chat applications)
- License (ensure compliance with model licenses)
- Performance Expectations: Local models may be slower than commercial APIs depending on hardware
- Security: Even for local endpoints, apply appropriate security policies in AI Controller
Example: Configuring a Local Endpoint for Different Models
You can configure multiple providers for different models running on the same local server. For instance, with Ollama:
-
Set up provider for Llama 3:
- Name:
local-llama3
- URL:
http://[server-ip]:11434/api/generate
- Include model name in the request payload
- Name:
-
Set up provider for Mistral:
- Name:
local-mistral
- URL:
http://[server-ip]:11434/api/generate
- Include model name in the request payload
- Name:
This approach allows you to use different local models for different use cases while maintaining centralized management through AI Controller.
Troubleshooting
Common issues when integrating local LLMs with AI Controller:
Issue | Possible Causes | Solutions |
---|---|---|
Connection timeout | - LLM server not running - Network issues |
- Verify server is running - Check network connectivity |
Model not loading | - Insufficient memory - Corrupted model file |
- Check hardware resources - Re-download model |
Slow responses | - Hardware limitations - Large context size |
- Upgrade hardware - Reduce context size |
Incompatible API | - Different API format | - Check API compatibility - Use OpenAI-compatible server mode |
Resources
Updated: 2025-05-27