BOOK THIS SPACE FOR AD
ARTICLE ADWith the rapid advancements in artificial intelligence (AI), running sophisticated models like Meta's Llama 3.1 locally on personal computers is becoming increasingly popular. Running an LLM on your local PC or Mac provides a sandbox for experimentation and development without compromising data privacy and allows for more flexibility in model usage.
Also: Why the future must be BYO AI: Model lock-in deters users and stifles innovation
Here is a quick guide to help you set up and run Llama 3.1 -- as well as many other models such as Google Gemma2 -- on Mac, Linux, and Windows. I'll also discuss the benefits of privately hosted models.
Why develop and test against different open-source models?
Llama 3.1 8b running on Ollama/Open WebUI
Developing and testing against various open source models you privately host and run offers several advantages over relying solely on publicly hosted large language models (LLMs) from providers like OpenAI, Microsoft CoPilot, Meta AI, and Google Gemini.
Data privacy: Publicly hosted LLMs require sending data over the internet, which can raise privacy and security concerns. Running models locally ensures that sensitive data remains on your own hardware.
Customization: Open-source models allow for greater customization. Developers can fine-tune models, adjust hyperparameters, and modify the architecture to suit specific use cases better.
Cost control: Cloud-based AI services can be costly, especially for large-scale applications. Hosting models locally can significantly reduce ongoing API usage and data transfer expenses.
Offline capability: Local models can be used without an internet connection, which is essential for applications requiring high availability or in areas with unreliable internet access.
Flexibility and experimentation: Hosting your own models enables you to experiment with different algorithms and configurations, leading to innovative solutions and a deeper understanding of AI technologies.
Freedom from usage policies: Running LLMs locally means the usage policies of companies like OpenAI, Microsoft, Meta, and Google do not restrict you. You can use whatever prompts you want and employ modified LLMs with lifted restrictions, trained on data that these services might restrict.
Also: The best AI chatbots: ChatGPT, Copilot, and worthy alternatives
Introduction to Ollama
Ollama is a versatile and MIT-licensed open-source platform designed to help developers and researchers easily run and manage machine learning models locally on their own hardware. It was developed by a team of AI enthusiasts and engineers who aim to provide tools that ensure data privacy, flexibility, and control over AI applications. Ollama supports various AI models, making it a valuable resource for those looking to explore and utilize AI technologies without relying on third-party cloud services.
Here are some example models that can be downloaded:
Llama 3.1 | 8B | 4.7GB | ollama run llama3.1 |
Llama 3.1 | 70B | 40GB | ollama run llama3.1:70b |
Llama 3.1 | 405B | 231GB | ollama run llama3.1:405b |
Phi 3 Mini | 3.8B | 2.3GB | ollama run phi3 |
Phi 3 Medium | 14B | 7.9GB | ollama run phi3:medium |
Gemma 2 | 2B | 1.6GB | ollama run gemma2:2b |
Gemma 2 | 9B | 5.5GB | ollama run gemma2 |
Gemma 2 | 27B | 16GB | ollama run gemma2:27b |
Mistral | 7B | 4.1GB | ollama run mistral |
Moondream 2 | 1.4B | 829MB | ollama run moondream |
Neural Chat | 7B | 4.1GB | ollama run neural-chat |
Starling | 7B | 4.1GB | ollama run starling-lm |
Code Llama | 7B | 3.8GB | ollama run codellama |
Llama 2 Uncensored | 7B | 3.8GB | ollama run llama2-uncensored |
LLaVA | 7B | 4.5GB | ollama run llava |
Solar | 10.7B | 6.1GB | ollama run solar |
Per Ollama's GitHub page, you should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
Our test systems
I tested Ollama using M1 Pro and M1 Ultra Macs with 32GB and 64GB of RAM, which are a few generations behind current MacBook Pro models. Despite this, using CPU-only assistance, we successfully ran 8B-10B parameter models of Meta's Llama 3.1 and Google's Gemma2, as well as various specifically trained variants from Ollama's website, with better-than-acceptable performance.
Also: I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right)
However, I experienced significant performance issues with the 70B parameter variant using these systems. I'm confident that more recent hardware can handle these models even more efficiently, especially with Linux PCs enabled by Nvidia and AMD GPUs.
Step-by-step setup
Download and install Ollama
Go to Ollama's download page and download the installer suitable for your operating system (MacOS, Linux, Windows).Follow the provided installation instructions for your specific operating system.Load the 8B parameter Llama 3.1 Model
The Ollama command line interface with chat functionality.
Manage installed models
List models: Use the command ollama list to see all models installed on your system.Remove models: To remove a model, use the command ollama rm <model_name>. For example, to remove the 8B parameter Llama 3.1, you would use ollama rm llama3.1:8bAdd new models: To add a new model, browse the Ollama library and then use the appropriate ollama run <model_name> command to load it into your system.Also: 3 ways Meta's Llama 3.1 is an advance for Gen AI
Adding a WebUI
Install Docker Desktop
Visit Docker's Get Started page and download Docker Desktop for your operating system (MacOS, Linux, Windows).Follow the installation instructions for your specific operating system, and start Docker after installation.Install Open WebUI
Open a terminal (MacOS, Linux) or Command Prompt/PowerShell (Windows) and run the following command to install Open WebUI:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
Access the Open WebUI
Open WebUI running on Docker Desktop
Create and log in to your Open WebUI account
Selecting a model in Open WebUI
Integration with IDEs and APIs
Ollama can be integrated into various Integrated Development Environments (IDEs) using APIs, which enhances the development workflow by providing seamless interaction with AI models. One powerful tool for this integration is Continue, an open-source code assistant that leverages the Ollama API.
Also: If you want a career in AI, start with these 5 steps
Using Continue for IDE integration
Ensure that Ollama is running and accessible.Follow the Ollama Continue blog instructions to install Continue in your preferred IDE.With Continue and the Ollama API, you can directly leverage AI-powered features like code suggestions, completions, and debugging assistance within your development environment.Scaling up with powerful GPUs
For more demanding applications, especially those requiring larger models like the 70B and 405B parameter Llama 3.1 models, running Ollama on a Linux-based system equipped with powerful GPUs is recommended. This setup can handle the computational load and provide faster response times, making it suitable for enterprise-level AI applications.
To use GPUs for running Ollama, follow these steps:
For NVIDIA GPUs:
Follow the NVIDIA CUDA documentation instructions to install CUDA and cuDNN on your system.After installing CUDA and cuDNN, ensure your environment is configured correctly, then run the following command:ollama run llama3.1:70b --use-gpu
For AMD GPUs:
Follow the instructions on the ROCm documentation to install ROCm on your system.After installing ROCm, ensure your environment is configured correctly, then run the following command:ollama run llama3.1:70b --use-gpu
These commands ensure that Ollama can utilize the available GPUs on your system, providing the necessary computational power for running large models. For more detailed instructions, refer to the Ollama GPU documentation.
Running Ollama in a Docker container
You can still leverage GPU support if you prefer running Ollama in a container.
Also: How can business leaders ready their organizations for AI? 4 keys to success
For NVIDIA GPUs with Docker
As per the previous section, install CUDA and cuDNN on your system. Then, follow the instructions in the NVIDIA Docker documentation to install the NVIDIA Container Engine on your system.Use the following command to run Ollama with NVIDIA GPU support in a Docker container:docker run --gpus all -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama:/app/backend/data --name ollama --restart always ollama/ollama:latest
For AMD GPUs with Docker
Follow the instructions on the ROCm documentation to install ROCm on your system.Use the following command to run Ollama with ROCm support in a Docker container:docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm
These commands ensure the Docker container can access all available GPUs on your system, providing the necessary computational power to run large models. For more information on using GPUs with Docker and Ollama, refer to the Docker page on using GPUs with Ollama.
Also: Will OpenAI's new AI detection tool put an end to student cheating?
Conclusion
Running AI models such as Meta's Llama 3.1 locally on your Mac or PC provides numerous benefits, including improved data privacy, greater customization, and cost savings. Following the steps in this guide, you can utilize advanced AI models and test different configurations to meet your requirements. Whether you are a developer, researcher, or AI enthusiast, having the ability to run complex models locally unlocks many opportunities.