Running LLM Model on Local Machine: Ollama, LlamaIndex and Langchain

#ai #genai #llm #langchain #llamaindex #ollama #aimodels https://github.com/ollama/ollama Ollama is a free application for locally running generative AI Large Language Models. Currently, it's available for MacOS and Linux, with Windows support in preview. Additionally, on Windows, you can easily utilize it within Windows Subsystem for Linux and Docker containers. The Ollama application allows you to pull the desired Large Language Models (LLMs) locally for running and serving. You can interact with the models directly from a command-line interface (CLI) or access them via a simple REST API. It can leverage your GPU, delivering lightning-fast performance on devices like MacBook Pro and PCs equipped with a good GPU. Specific models, such as the extensive Mistral models, require ample resources to run locally. Quantization plays a crucial role in compressing models and reducing their memory footprint. Pre-quantized models are available for reference at https://ollama.com/library. Quantization involves converting 32-bit floating-point numbers in the model to 4-bit integers. Despite its seemingly magical nature, quantization significantly reduces memory requirements without substantially compromising model performance. One of the recommended starting points among various quantization options is q4_0. This variant provides an optimal balance between memory savings and model performance. Quantization ensures efficient memory usage while maintaining the effectiveness of the model. Familiarizing yourself with and leveraging quantization can significantly enhance your experience with Ollama models. During your exploration of Ollama models, you may encounter tags beginning with "q" followed by a number (and sometimes a "k") [Show] Each model in the Ollama Library is accompanied by various tags that offer specific insights into its functionality. These tags are denoted by the text following the colon in the model's name. Here are some key points to understand about tags: The primary tag is commonly known as the "latest" tag, although it may not always represent the most recent version of the model. Instead, it indicates the most popular variation. When you don't specify a tag, Ollama automatically selects the model with the "latest" tag. Within the latest tag, you can find details such as the model's size, the beginning of the sha256 digest, and the age of that particular model variation. The right side of the tag presents the command required to run that specific version. Key features of Ollama: Automatic Hardware Acceleration: Ollama automatically detects and utilizes the best available hardware resources on Windows systems, including NVIDIA GPUs or CPUs with modern instruction sets like AVX or AVX2. This feature optimizes performance, ensuring efficient execution of AI models without the need for manual configuration. It saves time and resources, making projects run swiftly. No Need for Virtualization: Ollama eliminates the need for virtualization or complex environment setups typically required for running different models in AI development. Its seamless setup process allows developers to focus on their AI projects without worrying about setup intricacies. This simplicity lowers the entry barrier for individuals and organizations exploring AI technologies. Access to the Full Ollama Model Library: The platform grants unrestricted access to a comprehensive library of AI models. Users can experiment with and deploy various models without the hassle of sourcing and configuring them independently. Whether the interest lies in text analysis, image processing, or any other AI-driven domain, Ollama's library meets diverse needs. Always-On Ollama API: Ollama's always-on API seamlessly integrates with projects, running in the background and ready to connect to powerful AI capabilities without additional setup. This feature ensures that Ollama's AI resources are readily available, enhancing productivity and blending seamlessly into the development workflow. CLI Commands: ollama, ollama run model, ollama pull model, ollama create, ollama rm model, ollama cp model my-model, ollama list, ollama serve llamaindex - LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs) Langchain - LangChain is essentially a library of abstractions for Python and Javascript, representing common steps and concepts necessary to work with language models Docker: docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama docker start ollama docker stop ollama docker exec -it ollama ollama docker exec -it ollama ollama run model Prompt: Cite 20 famous people CURL: curl http://localhost:11434/api/generate -d '{ "model": "phi", "prompt":"Cite 20 famous people", "stream": false}' curl http://localhost:11434/api/generate \ -d '{"prompt": "Cite 20 famous people", "model": "phi", "options": {"temperature": 0.75, "num_ctx": 3900}, "stream": false}'

hi all welcome to Tech Forum in this video we will explore running the llm model on a local machine using ollama and tools like llama index and Lang chain I will demonstrate two different approaches for running the llm model on a Windows machine through the ollama docker container and the ollama windows installable which is currently in preview version ollama is a free application for locally running generative ai large language models currently it's available for macos and Linux with Windo

ws support in preview additionally on Windows you can easily utilize it within Windows sub system for Linux and Docker containers the ollama application allows you to pull the desired large language models locally for running and serving you can interact with the models directly from a command line interface CLI or access them via simple rest API it can leverage your CPU or GPU to run the llm models provides better performance on devices with a good GPU I am on a Windows machine with a CPU

and 16 GB RAM specific models such as the extensive mistral models require ample resource to run locally Quantization plays a crucial role in compressing models and reducing their memory footprint the Pre-quantized models are available for your reference here uh ollama.com Library you can find various models including Gemma Lama 2 mistral and phi and other models Quantization involves converting 32bit floating Point numbers in the model to 4 bit integers Quantization significantly reduces t

he memory requirements without substantially compromising model performance one of the recommended starting points among various Quantization option is q4_0 this variant provides an optimal balance between memory savings and model performance during your exploration of ollama models you may encounter tags beginning with q followed by a number and sometimes K let me open one of the model I will go with gemma click on tags you can find various models if you see latest text also if you come do

wn you can see various quantisation model Q4 _ 0 q4_1 Q5 _ 0 also if you see here Q2 _ K you can find various Quantized models every model in the ollama Library comes with different tags that gives you you specific info about what it does these tags are denoted by the text following the colon in the models name here are some key points to understand about tags the primary tag is commonly known as the latest tag although it may not always represent the most recent version of the model instead

it indicates the most popular variation when the tag is not specified ollama automatically selects the model with the latest tag within the latest tag you can find Details such as the model size if you see here latest 5.2 GB and also the beginning of the sha256 digest you can see this value here and the age of that particular model variation if you see this one two days ago the right side of the tag presents the command required to run that specific version if you see here ollama run gemma

here this is the latest version also you can find different commands like 7B then 2B_text also different Quantized version q4_0 and different ones even you can click on here if you go into latest you can find different details like model family then parameters and quantization if you see 4bit and when it was published key features of ollama the first one is automatic Hardware acceleration ollama automatically finds and uses the best hardware on Windows like gpus or CPUs with modern instruc

tion sets this speeds up performance and makes running AI models effortless saving you time and resources the second one is no need for virtualization ollama eliminates the need for virtualization or complex environment setups typically required for running different models in AI development third one is access to the full ollama model Library the platform grants unrestricted access to a comprehensive library of AI models the next one is always on ollama API ollama's always on API seamlessl

y integrate project with powerful AI capabilities without additional setup refer to this GitHub page for more details on ollama download Mac OS version or Windows preview version or Linux version from here even docker ollama image can can be used to run ollama in your local machine refer to this page to understand more on ollama rest API as we discussed earlier ollama enables various API endpoints to interact with llm models some of the important apis if you see here generate a completion p

ost api/generate you can see the various parameters model prompt then images uh then optional parameters like format options system template you can pass it also stream equal to false or true um then also keep alive you can send various parameters here if you see here this is the way we can invoke um API generate then passing some parameters like model Lama 2 then prompt then we will get the response so also if you come down you can see other apis like generate a chat completion the API is

api/chat here again model name then messages part of the message we can send role system user or assistant uh then content then even images you can send it also some optional parameters like format options um then template also other apis even you can create a new model from the model file so you can refer here even you can List local models like through this /api/tags it will return all the models deployed on your local system also you can see the model information through api/show you can

copy the model api/copy you can delete a model from your local machine api/delete even you can pull a model to your local machine through api/pull even you can push a model to olama Library uh api/push but you should register to ollama.ai other important one is like you can generate embeddings like /api/embeddings you can pass the model and prompt this will return the embeddings here are the list of ollama CLI commands ollama pull model name this pulls models from ollama library to your local

system the next one is ollama run again model name this runs the specific model on your local system if the specific model is not available on your system this command first pull the model into your system then ollama RM remove model name this remove specific model from your local system the next one is ollama CP means copy copy a model to different name the next one is ollama list this list the models deployed on your local system the next one is ollama serve this starts the ollama servic

e on your system without running the desktop application all the CLI commands internally invoke the apis whatever we discussed earlier the next one is ollama python Library this helps to integrate python projects with ollama so if you see here as a first step we should install ollama package then you can use various functions like ollama.chat then you can pass model name uh then the messages what is the role what is the content you can get the response also you have various options you can

specify uh then also chat then generate uh you can specify model and prompt also list show uh create copy um then delete pull push also embeddings again these all functions internally called the ollama apis next one is ollama JavaScript library this helps to integrate JavaScript projects with ollama again as a first step install ollama npm package then you can invoke various functions like chat generate create list here also same these functions internally invokes the ollama apis additional

ly ollama supports various tools and libraries including Lang chain and llama index we can refer here for more details let us now see how to run ollama in a Windows system through Docker container as a first step execute Docker run then volume then we are saying the port number then the image name ollama ollama/ollama to pull ollama Docker image to your system and to run the container this is mainly for CPU if your system is enabled with GPU follow different commands once this step is compl

eted then you can execute specific commands Docker execute minus it then ollama ollama run Lama 2 here we are running Lama 2 model on our local yeah execute this command because I already executed the container already exist so I'm going directly and starting this container docker start ollama even you can stop with this command the container is now started now I am going and execute this specific command ollama ollama this displays all the available commands we already discussed about this

let me run a specific model now if you see here ollama run Phi I'm going with this model so this model I already executed so it won't download now but if you run first time as a first step the model is downloaded then it will execute okay fine now I can send a message let me copy this this may take some time based on your system capacity now you can see the model response for your prompt you can send different prompts to the model now if you go to the docker container logs you can see the

API request to the api/chat endpoint the post request to verify the server is running you can go to Local Host the same port number what we referred earlier 11434 this displays ollama is running here you can type slash question mark to understand more available commands like /set you can set session variables those will be used while invoking the llm also show if I execute show this will show the model details if I can type again show info yeah you can see family uh Phi2 then parameter size

3B Quantization level q4_0 like that you can execute different commands you can load a model uh or save the model typing /bye end the session let us now see how to use Python library to perform llm operation as we discussed earlier as a first step install ollama package import ollama package then we are executing generate command here ollama.generate the model name Phi then prompt cite 20 famous people then we are printing the response from result also you can execute different commands li

ke chat or completion so let me go and execute this now if I go here now you can see the model response let us now see how to use JavaScript ollama library to invoke the llm operations ensure this ollama npm package is installed npm install ollama then import ollama from ollama then some chat config model is phi then prompt cite 20 famous people uh then we are invoking the generate method here if you see here ollama generate then passing the model name and prompt then we are just printing

the response let me now go and execute this now you can see the response from model let us now see how to directly invoke the rest API if you see here Curl then the URL Local Host 11434 API generate you can use different apis then passing the data model Phi then prompt again same prompt I'm using stream equal to false also even you can pass some options if required if you see again here options temperature uh maximum context then stream uh false again let me now go and execute this Curl com

mand so this fine now you can see the llm response let me now install the ollama Windows preview package you can download the package from here ollamasetup.exe once installed you can execute ollama serve to start the ollama server now the server is started now you can go to another command prompt here you can execute ollama run any model you need also you can execute any other commands whatever we discussed so this is fine now type any message I'm going to type the same prompt again whateve

r the options we earlier executed through Docker container you can execute through this uh Windows Server now you can find the llm response let us now see how to use some of the other Frameworks like Lang chain and Lama index Lang chain simplifies working with language models in Python and JavaScript by providing library of common steps and Concepts I'm using Lang chain along with ollama to implement RAG pipeline to embed my website content to a vector store and run llm in my local the llm

model Returns the response based on the user query and the context stored in the vector store to execute this python script you should install some of the packages Lang chain and Lang chain Community also import the required packages from Lang chain community and Lang chain I am using web based loader to load the content from a remote web page then using the text splitter and ollama embedding model model equal to Lama 2 then I'm using chroma Vector store to vectorize the documents store it

into the vector store then I'm using Lang chain retrieval QA chain with the required model ollama then Vector store as the retriever then execute the query the question like here I'm sending what is connection timeout agent because this blog is about the connection timeout in Apache sling then you can print the response from the llm model result let me now execute the python script ollama_Langchain.py this will take time because there is lot of execution need to happen embedding then Vector

store uh then the model need to be executed this error says the model Lama 2 is not in my local so we should pull that model so for that execute ollama pull llama 2 now the model is pulled to the local machine you can execute the python script so let me do that this will take time based on your system capacity I already executed this let me show the output if you see here the llm providing the response based on the user prompt and the context provided by the blog data let us now see how to

use llama index llama index is a simple flexible data framework for connecting custom data sources to large language models I have a simple one page document that talks about adobe experience manager I'm going to use this document as a context to create the response from llm model this is the python script as a first of install all the required packages llama index llama Index llms ollama llama index embeddings huggingface these are all the package I installed then import the required pack

ages here I am using llama 2 model ollama model equal to Lama 2 you can specify the request timeout here I'm specifying a higher value because my system capacity is not that great so it will obviously take time so you can set the maximum time based on your system capacity now I'm reading the PDF documents here simple directory reader I'm setting the context settings. llm the llm defined here uh then embed model is I'm using this huggingface model BGE small en v1.5 uh then Chunk Size 512 um

chunk overlap 64 you can change it based on your use cases then index Vector store index from documents what are the documents I read it from that folder documents then query engine index dot as query engine I'm sending the query what is AEM the response to this content is available in my PDF document so then response query engine. query then executing the query printing the response the execution will take time I will show you the response received from my earlier execution python ollama_l

lamaindex.py this is the response this response is based on the data provided from the PDF document what is adob experience manager it explains the details let us now see how to use open web UI a user friendly UI to execute llm operations on ollama models you can run using a Docker container so Docker run uh the port number you can use this command to execute the UI I already installed the container I'm going to start the container okay now the container is running open web UI ensure even o

llama is running now access the UI Local Host 3,000 so you should be signing up for a account this is a local administrator account I already completed I'm going to sign in now signed in if you see here select a model you can see different models available in your local system if you need any additional model you can run ollama run the model name whatever we did it earlier so now if I select this Phi this fine now I can go and execute the prompt now you can see the llm response Additionall

y you can enable some of the settings if you click on this you can see here General um you can specify the system prompt notification even Advanced parameter uh you can specify connections if you see here ollama API URL you can specify even you can specify open AI API key and base URL uh then models even perform some of the model operations here you can enter the model name or tag then you can download the model uh then you can select the model to delete it interface you can perform some ch

anges audio images um chats you can specify some configurations also the local account as discussed earlier even you can use various tools and Frameworks along with ollama ollama is a fantastic application that facilitates the quick execution of llm models on your local machine it offers a library of Quantized models that can be directly run on your local machine additionally ollama provides apis that can be seemlessly integrate with various applications moreover it supports interaction wi

th different libraries such as Python and JS as well as integration with various tools and Frameworks like Lang chain and Lama index that's all for today thank you all for watching the video and see you in the next video

Running LLM Model on Local Machine: Ollama, LlamaIndex and Langchain

Related articles

Comments