tools

Guide to Local LLMs: Getting Started with Ollama, LM Studio and more

|
MacFleet Team
Local LLM setup on a computer

With the growing interest in AI privacy and customization, running large language models (LLMs) locally on your own hardware has become increasingly popular. But for beginners, the ecosystem of tools like Ollama, LM Studio, and Open WebUI can be overwhelming. This guide breaks down everything you need to know to get started with local LLMs.

Understanding local LLMs

Running LLMs locally offers several advantages:

  • Complete Privacy: Your data never leaves your machine
  • No Subscription Costs: Use open-source models for free
  • Customization: Fine-tune models for specific use cases
  • Offline Access: Work without an internet connection

Hardware requirements

Your hardware will determine which models you can run effectively:

GPU VRAM Requirements

  • 4GB VRAM: Run Gemma 2B, Phi 3 Mini at Q8 or Llama 3 8B/Gemma 9B at Q4
  • 8GB VRAM: Run Llama 3 8B/Gemma 9B at Q8
  • 16GB VRAM: Run Gemma 27B/Command R 35B at Q4
  • 24GB VRAM: Run Gemma 27B at Q6 or Llama 3 70B at Q2

Quantizations (Q2, Q4, etc.) compress models to run on less powerful hardware. Q8 offers high quality with minimal intelligence loss, while Q2 is suitable only for large models on non-coding tasks.

Best tools for beginners

LM Studio

LM Studio offers the simplest entry point for beginners:

  • Easy-to-use GUI interface
  • Built-in model library with one-click downloads
  • Automatic quantization options
  • OpenAI-compatible API server
  • Support for embedding models like Nomic Embed v1.5

Ollama

Ollama provides a more developer-focused approach:

  • Command-line interface (simple but powerful)
  • Great for programmers and API integration
  • Excellent performance optimization
  • Works well with various front-ends

AnythingLLM

AnythingLLM combines document processing with local LLMs:

  • Built-in RAG (Retrieval-Augmented Generation)
  • Document indexing and vectorization
  • User-friendly interface
  • Both local and cloud model support

Open WebUI

A powerful front-end primarily for Ollama:

  • Rich feature set
  • Multi-user support
  • Works over local networks
  • Customization options

Step-by-step setup guide

Getting started with LM Studio

  1. Download and install LM Studio from their website
  2. Browse the model library and download a model that fits your hardware
  3. Select your preferred quantization level
  4. Run the model locally and start chatting
  5. Optionally, enable the API server to connect with other applications

Popular frameworks for running LLMs locally

There are several excellent frameworks for running LLMs on your local machine. Here's a breakdown of the most user-friendly options:

1. GPT4All

GPT4All is one of the most beginner-friendly options for running LLMs locally:

  • Easy setup: Simple installation process with a user-friendly GUI
  • GPU acceleration: Automatically uses CUDA if available
  • OpenAI integration: Can use your OpenAI API key to access GPT-3.5/4
  • Context-aware responses: Connect local folders for document-based queries
  • API server: Enable the API server for integration with other applications

Explore GPT4All →

2. LM Studio

LM Studio offers more customization than GPT4All:

  • Rich model library: Easy access to download models from Hugging Face
  • Multiple model sessions: Run and compare different models simultaneously
  • Advanced configuration: Fine-tune model parameters for optimal performance
  • Local inference server: Launch an API server with one click
  • High performance: Optimized for speed with GPU acceleration

Explore LM Studio →

3. AnythingLLM

AnythingLLM combines document processing with local LLMs:

  • Built-in RAG: Integrated Retrieval-Augmented Generation
  • Document indexing: Automatically processes and vectorizes your content
  • User-friendly interface: Clean design for easy interaction
  • Flexible model support: Works with both local and cloud models
  • Multi-user capability: Supports team collaboration

Explore AnythingLLM →

4. Jan

Jan combines speed with an elegant interface:

  • Fast response generation: Generates responses at ~53 tokens/sec
  • Beautiful UI: Clean, ChatGPT-like interface
  • Model importing: Import models from other frameworks
  • Extensions: Install extensions to enhance functionality
  • Proprietary model support: Use models from OpenAI, MistralAI, and Groq

Explore Jan →

5. llama.cpp

A powerful C/C++ implementation that powers many LLM applications:

  • High efficiency: Written in C/C++ for maximum performance
  • Flexible deployment: Run via command line or web interface
  • GPU acceleration: Install CUDA-enabled version for faster responses
  • Deep customization: Fine-tune all model parameters
  • Developer-friendly: Great for integrating into custom applications

Explore llama.cpp →

6. llamafile

Simplifies llama.cpp into a single executable file:

  • Single-file executable: Combines llama.cpp with Cosmopolitan Libc
  • No configuration needed: Automatically uses GPU without setup
  • Multimodal support: Models like LLaVA can process images and text
  • High performance: Much faster than standard llama.cpp (up to 5x)
  • Cross-platform: Works on Windows, macOS, and Linux seamlessly

Explore llamafile →

7. Ollama

Command-line focused tool with wide application support:

  • Terminal-based: Easy to use through command line
  • Wide model support: Access Llama 3, Mistral, Gemma, and more
  • Application integration: Many applications accept Ollama integration
  • Custom model support: Use downloaded models from other frameworks
  • Simple commands: Easy-to-remember commands for model management

Get started with our Ollama guide →

8. NextChat

Perfect for those who want to use proprietary models locally:

  • API integration: Use GPT-3, GPT-4, and Gemini Pro via API keys
  • Web UI available: Also available as a web application
  • One-click deployment: Deploy your own web instance easily
  • Local data storage: User data saved locally for privacy
  • Customization options: Full control over model parameters

Explore NextChat →

Setting up document processing (RAG)

For those looking to chat with their documents:

  1. Choose a solution with RAG capabilities (AnythingLLM, Jan.io)
  2. Import your documents (PDFs, Word files, code repositories)
  3. The system will automatically index and vectorize your content
  4. Connect to your local LLM or a cloud provider
  5. Start asking questions about your documents

Advanced topics

Understanding model sizes and capabilities

Different model sizes offer various capabilities:

  • Small models (2B-8B parameters): Basic assistance, limited reasoning
  • Medium models (8B-30B parameters): Good reasoning, coding abilities
  • Large models (30B+ parameters): Advanced reasoning, specialized knowledge

Running models on multiple GPUs

For larger models, you can distribute the workload:

  • Use tensor parallelism to split models across GPUs
  • Configure VRAM allocation for optimal performance
  • Balance between GPU and CPU offloading

Ready to start your local LLM?

Running local LLMs gives you control, privacy, and customization that cloud services can't match. Start with LM Studio for the easiest entry point, then explore other options as you become more comfortable with the technology.

Whether you're looking to chat privately with AI, process sensitive documents, or build custom applications, local LLMs offer a powerful alternative to cloud-based solutions. The initial learning curve is well worth the freedom and capabilities you'll gain.

Apple silicon as-a-Service

Discover why Macfleet is the preferred cloud provider for developers.