How to run dozens of AI models on your Mac or PC - no third-party cloud needed

3 months ago 20
BOOK THIS SPACE FOR AD
ARTICLE AD
devdo3gettyimages-1493766948
cofotoisme/Getty Images

With the rapid advancements in artificial intelligence (AI), running sophisticated models like Meta's Llama 3.1 locally on personal computers is becoming increasingly popular. Running an LLM on your local PC or Mac provides a sandbox for experimentation and development without compromising data privacy and allows for more flexibility in model usage.

Also: Why the future must be BYO AI: Model lock-in deters users and stifles innovation

Here is a quick guide to help you set up and run Llama 3.1  -- as well as many other models such as Google Gemma2 -- on Mac, Linux, and Windows. I'll also discuss the benefits of privately hosted models.

Why develop and test against different open-source models? 

untitled

Llama 3.1 8b running on Ollama/Open WebUI

Jason Perlow/ZDNET

Developing and testing against various open source models you privately host and run offers several advantages over relying solely on publicly hosted large language models (LLMs) from providers like OpenAI, Microsoft CoPilot, Meta AI, and Google Gemini.

Data privacy: Publicly hosted LLMs require sending data over the internet, which can raise privacy and security concerns. Running models locally ensures that sensitive data remains on your own hardware.

Customization: Open-source models allow for greater customization. Developers can fine-tune models, adjust hyperparameters, and modify the architecture to suit specific use cases better.

Cost control: Cloud-based AI services can be costly, especially for large-scale applications. Hosting models locally can significantly reduce ongoing API usage and data transfer expenses.

Offline capability: Local models can be used without an internet connection, which is essential for applications requiring high availability or in areas with unreliable internet access.

Flexibility and experimentation: Hosting your own models enables you to experiment with different algorithms and configurations, leading to innovative solutions and a deeper understanding of AI technologies.

Freedom from usage policies: Running LLMs locally means the usage policies of companies like OpenAI, Microsoft, Meta, and Google do not restrict you. You can use whatever prompts you want and employ modified LLMs with lifted restrictions, trained on data that these services might restrict.

Also: The best AI chatbots: ChatGPT, Copilot, and worthy alternatives

Introduction to Ollama

Ollama is a versatile and MIT-licensed open-source platform designed to help developers and researchers easily run and manage machine learning models locally on their own hardware. It was developed by a team of AI enthusiasts and engineers who aim to provide tools that ensure data privacy, flexibility, and control over AI applications. Ollama supports various AI models, making it a valuable resource for those looking to explore and utilize AI technologies without relying on third-party cloud services.

Here are some example models that can be downloaded:

ModelParametersSizeDownload
Llama 3.18B4.7GBollama run llama3.1
Llama 3.170B40GBollama run llama3.1:70b
Llama 3.1405B231GBollama run llama3.1:405b
Phi 3 Mini3.8B2.3GBollama run phi3
Phi 3 Medium14B7.9GBollama run phi3:medium
Gemma 22B1.6GBollama run gemma2:2b
Gemma 29B5.5GBollama run gemma2
Gemma 227B16GBollama run gemma2:27b
Mistral7B4.1GBollama run mistral
Moondream 21.4B829MBollama run moondream
Neural Chat7B4.1GBollama run neural-chat
Starling7B4.1GBollama run starling-lm
Code Llama7B3.8GBollama run codellama
Llama 2 Uncensored7B3.8GBollama run llama2-uncensored
LLaVA7B4.5GBollama run llava
Solar10.7B6.1GBollama run solar

Per Ollama's GitHub page, you should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Our test systems

I tested Ollama using M1 Pro and M1 Ultra Macs with 32GB and 64GB of RAM, which are a few generations behind current  MacBook Pro models. Despite this, using CPU-only assistance, we successfully ran 8B-10B parameter models of Meta's Llama 3.1 and Google's Gemma2, as well as various specifically trained variants from Ollama's website, with better-than-acceptable performance. 

Also: I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right)

However, I experienced significant performance issues with the 70B parameter variant using these systems. I'm confident that more recent hardware can handle these models even more efficiently, especially with Linux PCs enabled by Nvidia and AMD GPUs.

Step-by-step setup

Download and install Ollama

Go to Ollama's download page and download the installer suitable for your operating system (MacOS, Linux, Windows).Follow the provided installation instructions for your specific operating system.

Load the 8B parameter Llama 3.1 Model

untitled-2

The Ollama command line interface with chat functionality.

Screenshot by Jason Perlow/ZDNETGo to the Llama 3.1 library page on Ollama and copy the command for loading the 8B Llama 3.1 model: ollama run llama3.1:8bOpen a terminal (MacOS, Linux) or Command Prompt/PowerShell (Windows), paste the above command, and hit <enter>.This command will start running Llama 3.1. In the terminal, you can then issue chat queries to the model to test its functionality.

Manage installed models

List models: Use the command ollama list to see all models installed on your system.Remove models: To remove a model, use the command ollama rm <model_name>. For example, to remove the 8B parameter Llama 3.1, you would use ollama rm llama3.1:8bAdd new models: To add a new model, browse the Ollama library and then use the appropriate ollama run <model_name> command to load it into your system.

Also: 3 ways Meta's Llama 3.1 is an advance for Gen AI

Adding a WebUI

Install Docker Desktop

Visit Docker's Get Started page and download Docker Desktop for your operating system (MacOS, Linux, Windows).Follow the installation instructions for your specific operating system, and start Docker after installation.

Install Open WebUI

Open a terminal (MacOS, Linux) or Command Prompt/PowerShell (Windows) and run the following command to install Open WebUI:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Access the Open WebUI

untitled-1

Open WebUI running on Docker Desktop

Screenshot by Jason Perlow/ZDNETOpen Docker Desktop and go to the dashboard.Find the Open WebUI container and click on the link under Port to open the WebUI in your browser.

Create and log in to your Open WebUI account

untitled-3

Selecting a model in Open WebUI

Screenshot by Jason Perlow/ZDNETIf you don't already have an Open WebUI account, create one.Log in to your account through the WebUI.

Integration with IDEs and APIs

Ollama can be integrated into various Integrated Development Environments (IDEs) using APIs, which enhances the development workflow by providing seamless interaction with AI models. One powerful tool for this integration is Continue, an open-source code assistant that leverages the Ollama API.

Also: If you want a career in AI, start with these 5 steps

Using Continue for IDE integration

Ensure that Ollama is running and accessible.Follow the Ollama Continue blog instructions to install Continue in your preferred IDE.With Continue and the Ollama API, you can directly leverage AI-powered features like code suggestions, completions, and debugging assistance within your development environment.

Scaling up with powerful GPUs

For more demanding applications, especially those requiring larger models like the 70B and 405B parameter Llama 3.1 models, running Ollama on a Linux-based system equipped with powerful GPUs is recommended. This setup can handle the computational load and provide faster response times, making it suitable for enterprise-level AI applications.

To use GPUs for running Ollama, follow these steps:

For NVIDIA GPUs:

Follow the NVIDIA CUDA documentation instructions to install CUDA and cuDNN on your system.After installing CUDA and cuDNN, ensure your environment is configured correctly, then run the following command:
ollama run llama3.1:70b --use-gpu

For AMD GPUs:

Follow the instructions on the ROCm documentation to install ROCm on your system.After installing ROCm, ensure your environment is configured correctly, then run the following command:
ollama run llama3.1:70b --use-gpu

These commands ensure that Ollama can utilize the available GPUs on your system, providing the necessary computational power for running large models. For more detailed instructions, refer to the Ollama GPU documentation.

Running Ollama in a Docker container

You can still leverage GPU support if you prefer running Ollama in a container.

Also: How can business leaders ready their organizations for AI? 4 keys to success

For NVIDIA GPUs with Docker

As per the previous section, install CUDA and cuDNN on your system. Then, follow the instructions in the NVIDIA Docker documentation to install the NVIDIA Container Engine on your system.Use the following command to run Ollama with NVIDIA GPU support in a Docker container:
docker run --gpus all -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama:/app/backend/data --name ollama --restart always ollama/ollama:latest

For AMD GPUs with Docker

Follow the instructions on the ROCm documentation to install ROCm on your system.Use the following command to run Ollama with ROCm support in a Docker container:
docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

These commands ensure the Docker container can access all available GPUs on your system, providing the necessary computational power to run large models. For more information on using GPUs with Docker and Ollama, refer to the Docker page on using GPUs with Ollama.

Also: Will OpenAI's new AI detection tool put an end to student cheating?

Conclusion

Running AI models such as Meta's Llama 3.1 locally on your Mac or PC provides numerous benefits, including improved data privacy, greater customization, and cost savings. Following the steps in this guide, you can utilize advanced AI models and test different configurations to meet your requirements. Whether you are a developer, researcher, or AI enthusiast, having the ability to run complex models locally unlocks many opportunities.

Read Entire Article