Main

Running LLM Model on Local Machine: Ollama, LlamaIndex and Langchain

#ai #genai #llm #langchain #llamaindex #ollama #aimodels https://github.com/ollama/ollama Ollama is a free application for locally running generative AI Large Language Models. Currently, it's available for MacOS and Linux, with Windows support in preview. Additionally, on Windows, you can easily utilize it within Windows Subsystem for Linux and Docker containers. The Ollama application allows you to pull the desired Large Language Models (LLMs) locally for running and serving. You can interact with the models directly from a command-line interface (CLI) or access them via a simple REST API. It can leverage your GPU, delivering lightning-fast performance on devices like MacBook Pro and PCs equipped with a good GPU. Specific models, such as the extensive Mistral models, require ample resources to run locally. Quantization plays a crucial role in compressing models and reducing their memory footprint. Pre-quantized models are available for reference at https://ollama.com/library. Quantization involves converting 32-bit floating-point numbers in the model to 4-bit integers. Despite its seemingly magical nature, quantization significantly reduces memory requirements without substantially compromising model performance. One of the recommended starting points among various quantization options is q4_0. This variant provides an optimal balance between memory savings and model performance. Quantization ensures efficient memory usage while maintaining the effectiveness of the model. Familiarizing yourself with and leveraging quantization can significantly enhance your experience with Ollama models. During your exploration of Ollama models, you may encounter tags beginning with "q" followed by a number (and sometimes a "k") [Show] Each model in the Ollama Library is accompanied by various tags that offer specific insights into its functionality. These tags are denoted by the text following the colon in the model's name. Here are some key points to understand about tags: The primary tag is commonly known as the "latest" tag, although it may not always represent the most recent version of the model. Instead, it indicates the most popular variation. When you don't specify a tag, Ollama automatically selects the model with the "latest" tag. Within the latest tag, you can find details such as the model's size, the beginning of the sha256 digest, and the age of that particular model variation. The right side of the tag presents the command required to run that specific version. Key features of Ollama: Automatic Hardware Acceleration: Ollama automatically detects and utilizes the best available hardware resources on Windows systems, including NVIDIA GPUs or CPUs with modern instruction sets like AVX or AVX2. This feature optimizes performance, ensuring efficient execution of AI models without the need for manual configuration. It saves time and resources, making projects run swiftly. No Need for Virtualization: Ollama eliminates the need for virtualization or complex environment setups typically required for running different models in AI development. Its seamless setup process allows developers to focus on their AI projects without worrying about setup intricacies. This simplicity lowers the entry barrier for individuals and organizations exploring AI technologies. Access to the Full Ollama Model Library: The platform grants unrestricted access to a comprehensive library of AI models. Users can experiment with and deploy various models without the hassle of sourcing and configuring them independently. Whether the interest lies in text analysis, image processing, or any other AI-driven domain, Ollama's library meets diverse needs. Always-On Ollama API: Ollama's always-on API seamlessly integrates with projects, running in the background and ready to connect to powerful AI capabilities without additional setup. This feature ensures that Ollama's AI resources are readily available, enhancing productivity and blending seamlessly into the development workflow. CLI Commands: ollama, ollama run model, ollama pull model, ollama create, ollama rm model, ollama cp model my-model, ollama list, ollama serve llamaindex - LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs) Langchain - LangChain is essentially a library of abstractions for Python and Javascript, representing common steps and concepts necessary to work with language models Docker: docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama docker start ollama docker stop ollama docker exec -it ollama ollama docker exec -it ollama ollama run model Prompt: Cite 20 famous people CURL: curl http://localhost:11434/api/generate -d '{ "model": "phi", "prompt":"Cite 20 famous people", "stream": false}' curl http://localhost:11434/api/generate \ -d '{"prompt": "Cite 20 famous people", "model": "phi", "options": {"temperature": 0.75, "num_ctx": 3900}, "stream": false}'

Tech Forum

4 days ago

hi all welcome to Tech Forum in this video we  will explore running the llm model on a local machine using ollama and tools like llama index  and Lang chain I will demonstrate two different approaches for running the llm model on a Windows  machine through the ollama docker container and the ollama windows installable which is currently  in preview version ollama is a free application for locally running generative ai large language  models currently it's available for macos and Linux with Windo
ws support in preview additionally on  Windows you can easily utilize it within Windows sub system for Linux and Docker containers the  ollama application allows you to pull the desired large language models locally for running and  serving you can interact with the models directly from a command line interface CLI or access  them via simple rest API it can leverage your CPU or GPU to run the llm models provides better  performance on devices with a good GPU I am on a Windows machine with a CPU
and 16 GB RAM specific  models such as the extensive mistral models require ample resource to run locally Quantization plays  a crucial role in compressing models and reducing their memory footprint the Pre-quantized models  are available for your reference here uh ollama.com Library you can find various models including  Gemma Lama 2 mistral and phi and other models Quantization involves converting 32bit floating Point  numbers in the model to 4 bit integers Quantization significantly reduces t
he memory requirements  without substantially compromising model performance one of the recommended starting points  among various Quantization option is q4_0 this variant provides an optimal balance between memory savings  and model performance during your exploration of ollama models you may encounter tags beginning with  q followed by a number and sometimes K let me open one of the model I will go with gemma click on  tags you can find various models if you see latest text also if you come do
wn you can see various  quantisation model Q4 _ 0 q4_1 Q5 _ 0 also if you see here Q2 _ K you can find various Quantized models every model in the ollama Library comes with different tags that gives you you specific info  about what it does these tags are denoted by the text following the colon in the models name here  are some key points to understand about tags the primary tag is commonly known as the latest tag  although it may not always represent the most recent version of the model instead
it indicates  the most popular variation when the tag is not specified ollama automatically selects the model  with the latest tag within the latest tag you can find Details such as the model size if you  see here latest 5.2 GB and also the beginning of the sha256 digest you can see this value here  and the age of that particular model variation if you see this one two days ago the right side  of the tag presents the command required to run that specific version if you see here ollama run gemma
here this is the latest version also you can find different commands like 7B then 2B_text also  different Quantized version q4_0 and different ones even you can click on here if you go into latest  you can find different details like model family then parameters and quantization if you see 4bit  and when it was published key features of ollama the first one is automatic Hardware acceleration  ollama automatically finds and uses the best hardware on Windows like gpus or CPUs with modern  instruc
tion sets this speeds up performance and makes running AI models effortless saving  you time and resources the second one is no need for virtualization ollama eliminates the  need for virtualization or complex environment setups typically required for running different  models in AI development third one is access to the full ollama model Library the platform grants  unrestricted access to a comprehensive library of AI models the next one is always on ollama API  ollama's always on API seamlessl
y integrate project with powerful AI capabilities without additional  setup refer to this GitHub page for more details on ollama download Mac OS version or Windows preview  version or Linux version from here even docker ollama image can can be used to run ollama in your local  machine refer to this page to understand more on ollama rest API as we discussed earlier ollama enables  various API endpoints to interact with llm models some of the important apis if you see here generate  a completion p
ost api/generate you can see the various parameters model prompt then images  uh then optional parameters like format options system template you can pass it also stream equal  to false or true um then also keep alive you can send various parameters here if you see here this  is the way we can invoke um API generate then passing some parameters like model Lama 2 then  prompt then we will get the response so also if you come down you can see other apis like generate  a chat completion the API is
api/chat here again model name then messages part of the message  we can send role system user or assistant uh then content then even images you can send it also some  optional parameters like format options um then template also other apis even you can create a new  model from the model file so you can refer here even you can List local models like through this /api/tags it will return all the models deployed on your local system also you can see  the model information through api/show you can
copy the model api/copy you can delete a model  from your local machine api/delete even you can pull a model to your local machine through api/pull even you can push a model to olama Library uh api/push but you should register to ollama.ai other important one is like you can generate embeddings like /api/embeddings you can pass the model and  prompt this will return the embeddings here are the list of ollama CLI commands ollama pull model name  this pulls models from ollama library to your local
system the next one is ollama run again model name  this runs the specific model on your local system if the specific model is not available on your  system this command first pull the model into your system then ollama RM remove model name this  remove specific model from your local system the next one is ollama CP means copy copy a model to  different name the next one is ollama list this list the models deployed on your local system  the next one is ollama serve this starts the ollama servic
e on your system without running the desktop  application all the CLI commands internally invoke the apis whatever we discussed earlier the next  one is ollama python Library this helps to integrate python projects with ollama so if you see here as a  first step we should install ollama package then you can use various functions like ollama.chat then you can  pass model name uh then the messages what is the role what is the content you can get the response  also you have various options you can
specify uh then also chat then generate uh you can specify  model and prompt also list show uh create copy um then delete pull push also embeddings again these  all functions internally called the ollama apis next one is ollama JavaScript library this helps to  integrate JavaScript projects with ollama again as a first step install ollama npm package then you can  invoke various functions like chat generate create list here also same these functions internally  invokes the ollama apis additional
ly ollama supports various tools and libraries including Lang chain  and llama index we can refer here for more details let us now see how to run ollama in a Windows system  through Docker container as a first step execute Docker run then volume then we are saying the  port number then the image name ollama ollama/ollama to pull ollama Docker image to your system and to run the  container this is mainly for CPU if your system is enabled with GPU follow different commands  once this step is compl
eted then you can execute specific commands Docker execute minus it then ollama  ollama run Lama 2 here we are running Lama 2 model on our local yeah execute this command because I  already executed the container already exist so I'm going directly and starting this container  docker start ollama even you can stop with this command the container is now started now I am  going and execute this specific command ollama ollama this displays all the available commands we already  discussed about this
let me run a specific model now if you see here ollama run Phi I'm going with  this model so this model I already executed so it won't download now but if you run first time  as a first step the model is downloaded then it will execute okay fine now I can send a message  let me copy this this may take some time based on your system capacity now you can see the  model response for your prompt you can send different prompts to the model now if you go to  the docker container logs you can see the
API request to the api/chat endpoint the post  request to verify the server is running you can go to Local Host the same port number what  we referred earlier 11434 this displays ollama is running here you can type slash question mark  to understand more available commands like /set you can set session variables those will be used  while invoking the llm also show if I execute show this will show the model details if I can  type again show info yeah you can see family uh Phi2 then parameter size
3B Quantization level  q4_0 like that you can execute different commands you can load a model uh or save the model typing  /bye end the session let us now see how to use Python library to perform llm operation as we  discussed earlier as a first step install ollama package import ollama package then we are executing  generate command here ollama.generate the model name Phi then prompt cite 20 famous people then we  are printing the response from result also you can execute different commands li
ke chat or  completion so let me go and execute this now if I go here now you can see the model response let  us now see how to use JavaScript ollama library to invoke the llm operations ensure this  ollama npm package is installed npm install ollama then import ollama from ollama then some chat  config model is phi then prompt cite 20 famous people uh then we are invoking the generate  method here if you see here ollama generate then passing the model name and prompt then we are just  printing
the response let me now go and execute this now you can see the response from model  let us now see how to directly invoke the rest API if you see here Curl then the URL Local Host  11434 API generate you can use different apis then passing the data model Phi then prompt again  same prompt I'm using stream equal to false also even you can pass some options if required  if you see again here options temperature uh maximum context then stream uh false again  let me now go and execute this Curl com
mand so this fine now you can see the llm response  let me now install the ollama Windows preview package you can download the package from here  ollamasetup.exe once installed you can execute ollama serve to start the ollama server now the server  is started now you can go to another command prompt here you can execute ollama run any model  you need also you can execute any other commands whatever we discussed so this is fine now type  any message I'm going to type the same prompt again whateve
r the options we earlier executed  through Docker container you can execute through this uh Windows Server now you can find the llm  response let us now see how to use some of the other Frameworks like Lang chain and Lama index  Lang chain simplifies working with language models in Python and JavaScript by providing library of  common steps and Concepts I'm using Lang chain along with ollama to implement RAG pipeline to embed my  website content to a vector store and run llm in my local the llm
model Returns the response based  on the user query and the context stored in the vector store to execute this python script you  should install some of the packages Lang chain and Lang chain Community also import the required  packages from Lang chain community and Lang chain I am using web based loader to load the  content from a remote web page then using the text splitter and ollama embedding model model equal  to Lama 2 then I'm using chroma Vector store to vectorize the documents store it
into the vector  store then I'm using Lang chain retrieval QA chain with the required model ollama then Vector store as  the retriever then execute the query the question like here I'm sending what is connection timeout  agent because this blog is about the connection timeout in Apache sling then you can print the  response from the llm model result let me now execute the python script ollama_Langchain.py this  will take time because there is lot of execution need to happen embedding then Vector
store uh  then the model need to be executed this error says the model Lama 2 is not in my local so we  should pull that model so for that execute ollama pull llama 2 now the model is pulled to the local machine you can execute the  python script so let me do that this will take time based on your system  capacity I already executed this let me show the output if you see here the llm providing the  response based on the user prompt and the context provided by the blog data let us now see how to
  use llama index llama index is a simple flexible data framework for connecting custom data sources  to large language models I have a simple one page document that talks about adobe experience  manager I'm going to use this document as a context to create the response from llm model this  is the python script as a first of install all the required packages llama index llama Index llms ollama llama index embeddings huggingface these are all the package I installed then import the  required pack
ages here I am using llama 2 model ollama model equal to Lama 2 you can specify the  request timeout here I'm specifying a higher value because my system capacity is not that great  so it will obviously take time so you can set the maximum time based on your system capacity now  I'm reading the PDF documents here simple directory reader I'm setting the context settings. llm  the llm defined here uh then embed model is I'm using this huggingface model BGE small en v1.5  uh then Chunk Size 512 um
chunk overlap 64 you can change it based on your use cases then index  Vector store index from documents what are the documents I read it from that folder documents  then query engine index dot as query engine I'm sending the query what is AEM the response to this  content is available in my PDF document so then response query engine. query then executing the  query printing the response the execution will take time I will show you the response received  from my earlier execution python ollama_l
lamaindex.py this is the response this response is based  on the data provided from the PDF document what is adob experience manager it explains the  details let us now see how to use open web UI a user friendly UI to execute llm operations on  ollama models you can run using a Docker container so Docker run uh the port number you can use this  command to execute the UI I already installed the container I'm going to start the container  okay now the container is running open web UI ensure even o
llama is running now access the  UI Local Host 3,000 so you should be signing up for a account this is a local administrator  account I already completed I'm going to sign in now signed in if you see here select a model  you can see different models available in your local system if you need any additional model  you can run ollama run the model name whatever we did it earlier so now if I select this  Phi this fine now I can go and execute the prompt now you can see the llm response  Additionall
y you can enable some of the settings if you click on this you can see here  General um you can specify the system prompt notification even Advanced parameter uh you  can specify connections if you see here ollama API URL you can specify even you can specify  open AI API key and base URL uh then models even perform some of the model operations  here you can enter the model name or tag then you can download the model uh then you can  select the model to delete it interface you can perform some ch
anges audio images um chats you  can specify some configurations also the local account as discussed earlier even you  can use various tools and Frameworks along with ollama ollama is a fantastic application that  facilitates the quick execution of llm models on your local machine it offers a library of  Quantized models that can be directly run on your local machine additionally ollama provides apis  that can be seemlessly integrate with various applications moreover it supports interaction  wi
th different libraries such as Python and JS as well as integration with various tools  and Frameworks like Lang chain and Lama index that's all for today thank you all for  watching the video and see you in the next video

Comments