#ai #genai #llm #langchain #llamaindex #ollama #aimodels
https://github.com/ollama/ollama
Ollama is a free application for locally running generative AI Large Language Models. Currently, it's available for MacOS and Linux, with Windows support in preview. Additionally, on Windows, you can easily utilize it within Windows Subsystem for Linux and Docker containers.
The Ollama application allows you to pull the desired Large Language Models (LLMs) locally for running and serving. You can interact with the models directly from a command-line interface (CLI) or access them via a simple REST API. It can leverage your GPU, delivering lightning-fast performance on devices like MacBook Pro and PCs equipped with a good GPU.
Specific models, such as the extensive Mistral models, require ample resources to run locally. Quantization plays a crucial role in compressing models and reducing their memory footprint.
Pre-quantized models are available for reference at https://ollama.com/library. Quantization involves converting 32-bit floating-point numbers in the model to 4-bit integers. Despite its seemingly magical nature, quantization significantly reduces memory requirements without substantially compromising model performance.
One of the recommended starting points among various quantization options is q4_0. This variant provides an optimal balance between memory savings and model performance.
Quantization ensures efficient memory usage while maintaining the effectiveness of the model. Familiarizing yourself with and leveraging quantization can significantly enhance your experience with Ollama models.
During your exploration of Ollama models, you may encounter tags beginning with "q" followed by a number (and sometimes a "k") [Show]
Each model in the Ollama Library is accompanied by various tags that offer specific insights into its functionality. These tags are denoted by the text following the colon in the model's name. Here are some key points to understand about tags:
The primary tag is commonly known as the "latest" tag, although it may not always represent the most recent version of the model. Instead, it indicates the most popular variation.
When you don't specify a tag, Ollama automatically selects the model with the "latest" tag.
Within the latest tag, you can find details such as the model's size, the beginning of the sha256 digest, and the age of that particular model variation.
The right side of the tag presents the command required to run that specific version.
Key features of Ollama:
Automatic Hardware Acceleration: Ollama automatically detects and utilizes the best available hardware resources on Windows systems, including NVIDIA GPUs or CPUs with modern instruction sets like AVX or AVX2. This feature optimizes performance, ensuring efficient execution of AI models without the need for manual configuration. It saves time and resources, making projects run swiftly.
No Need for Virtualization: Ollama eliminates the need for virtualization or complex environment setups typically required for running different models in AI development. Its seamless setup process allows developers to focus on their AI projects without worrying about setup intricacies. This simplicity lowers the entry barrier for individuals and organizations exploring AI technologies.
Access to the Full Ollama Model Library: The platform grants unrestricted access to a comprehensive library of AI models. Users can experiment with and deploy various models without the hassle of sourcing and configuring them independently. Whether the interest lies in text analysis, image processing, or any other AI-driven domain, Ollama's library meets diverse needs.
Always-On Ollama API: Ollama's always-on API seamlessly integrates with projects, running in the background and ready to connect to powerful AI capabilities without additional setup. This feature ensures that Ollama's AI resources are readily available, enhancing productivity and blending seamlessly into the development workflow.
CLI Commands:
ollama, ollama run model, ollama pull model, ollama create, ollama rm model, ollama cp model my-model, ollama list, ollama serve
llamaindex - LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs)
Langchain - LangChain is essentially a library of abstractions for Python and Javascript, representing common steps and concepts necessary to work with language models
Docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker start ollama
docker stop ollama
docker exec -it ollama ollama
docker exec -it ollama ollama run model
Prompt: Cite 20 famous people
CURL:
curl http://localhost:11434/api/generate -d '{ "model": "phi", "prompt":"Cite 20 famous people", "stream": false}'
curl http://localhost:11434/api/generate \
-d '{"prompt": "Cite 20 famous people", "model": "phi", "options": {"temperature": 0.75, "num_ctx": 3900}, "stream": false}'
hi all welcome to Tech Forum in this video we
will explore running the llm model on a local machine using ollama and tools like llama index
and Lang chain I will demonstrate two different approaches for running the llm model on a Windows
machine through the ollama docker container and the ollama windows installable which is currently
in preview version ollama is a free application for locally running generative ai large language
models currently it's available for macos and Linux with Windo
ws support in preview additionally on
Windows you can easily utilize it within Windows sub system for Linux and Docker containers the
ollama application allows you to pull the desired large language models locally for running and
serving you can interact with the models directly from a command line interface CLI or access
them via simple rest API it can leverage your CPU or GPU to run the llm models provides better
performance on devices with a good GPU I am on a Windows machine with a CPU
and 16 GB RAM specific
models such as the extensive mistral models require ample resource to run locally Quantization plays
a crucial role in compressing models and reducing their memory footprint the Pre-quantized models
are available for your reference here uh ollama.com Library you can find various models including
Gemma Lama 2 mistral and phi and other models Quantization involves converting 32bit floating Point
numbers in the model to 4 bit integers Quantization significantly reduces t
he memory requirements
without substantially compromising model performance one of the recommended starting points
among various Quantization option is q4_0 this variant provides an optimal balance between memory savings
and model performance during your exploration of ollama models you may encounter tags beginning with
q followed by a number and sometimes K let me open one of the model I will go with gemma click on
tags you can find various models if you see latest text also if you come do
wn you can see various
quantisation model Q4 _ 0 q4_1 Q5 _ 0 also if you see here Q2 _ K you can find various Quantized
models every model in the ollama Library comes with different tags that gives you you specific info
about what it does these tags are denoted by the text following the colon in the models name here
are some key points to understand about tags the primary tag is commonly known as the latest tag
although it may not always represent the most recent version of the model instead
it indicates
the most popular variation when the tag is not specified ollama automatically selects the model
with the latest tag within the latest tag you can find Details such as the model size if you
see here latest 5.2 GB and also the beginning of the sha256 digest you can see this value here
and the age of that particular model variation if you see this one two days ago the right side
of the tag presents the command required to run that specific version if you see here ollama run gemma
here this is the latest version also you can find different commands like 7B then 2B_text also
different Quantized version q4_0 and different ones even you can click on here if you go into latest
you can find different details like model family then parameters and quantization if you see 4bit
and when it was published key features of ollama the first one is automatic Hardware acceleration
ollama automatically finds and uses the best hardware on Windows like gpus or CPUs with modern
instruc
tion sets this speeds up performance and makes running AI models effortless saving
you time and resources the second one is no need for virtualization ollama eliminates the
need for virtualization or complex environment setups typically required for running different
models in AI development third one is access to the full ollama model Library the platform grants
unrestricted access to a comprehensive library of AI models the next one is always on ollama API
ollama's always on API seamlessl
y integrate project with powerful AI capabilities without additional
setup refer to this GitHub page for more details on ollama download Mac OS version or Windows preview
version or Linux version from here even docker ollama image can can be used to run ollama in your local
machine refer to this page to understand more on ollama rest API as we discussed earlier ollama enables
various API endpoints to interact with llm models some of the important apis if you see here generate
a completion p
ost api/generate you can see the various parameters model prompt then images
uh then optional parameters like format options system template you can pass it also stream equal
to false or true um then also keep alive you can send various parameters here if you see here this
is the way we can invoke um API generate then passing some parameters like model Lama 2 then
prompt then we will get the response so also if you come down you can see other apis like generate
a chat completion the API is
api/chat here again model name then messages part of the message
we can send role system user or assistant uh then content then even images you can send it also some
optional parameters like format options um then template also other apis even you can create a new
model from the model file so you can refer here even you can List local models like through this /api/tags it will return all the models deployed on your local system also you can see
the model information through api/show you can
copy the model api/copy you can delete a model
from your local machine api/delete even you can pull a model to your local machine through api/pull even you can push a model to olama Library uh api/push but you should register to ollama.ai other
important one is like you can generate embeddings like /api/embeddings you can pass the model and
prompt this will return the embeddings here are the list of ollama CLI commands ollama pull model name
this pulls models from ollama library to your local
system the next one is ollama run again model name
this runs the specific model on your local system if the specific model is not available on your
system this command first pull the model into your system then ollama RM remove model name this
remove specific model from your local system the next one is ollama CP means copy copy a model to
different name the next one is ollama list this list the models deployed on your local system
the next one is ollama serve this starts the ollama servic
e on your system without running the desktop
application all the CLI commands internally invoke the apis whatever we discussed earlier the next
one is ollama python Library this helps to integrate python projects with ollama so if you see here as a
first step we should install ollama package then you can use various functions like ollama.chat then you can
pass model name uh then the messages what is the role what is the content you can get the response
also you have various options you can
specify uh then also chat then generate uh you can specify
model and prompt also list show uh create copy um then delete pull push also embeddings again these
all functions internally called the ollama apis next one is ollama JavaScript library this helps to
integrate JavaScript projects with ollama again as a first step install ollama npm package then you can
invoke various functions like chat generate create list here also same these functions internally
invokes the ollama apis additional
ly ollama supports various tools and libraries including Lang chain
and llama index we can refer here for more details let us now see how to run ollama in a Windows system
through Docker container as a first step execute Docker run then volume then we are saying the
port number then the image name ollama ollama/ollama to pull ollama Docker image to your system and to run the
container this is mainly for CPU if your system is enabled with GPU follow different commands
once this step is compl
eted then you can execute specific commands Docker execute minus it then ollama
ollama run Lama 2 here we are running Lama 2 model on our local yeah execute this command because I
already executed the container already exist so I'm going directly and starting this container
docker start ollama even you can stop with this command the container is now started now I am
going and execute this specific command ollama ollama this displays all the available commands we already
discussed about this
let me run a specific model now if you see here ollama run Phi I'm going with
this model so this model I already executed so it won't download now but if you run first time
as a first step the model is downloaded then it will execute okay fine now I can send a message
let me copy this this may take some time based on your system capacity now you can see the
model response for your prompt you can send different prompts to the model now if you go to
the docker container logs you can see the
API request to the api/chat endpoint the post
request to verify the server is running you can go to Local Host the same port number what
we referred earlier 11434 this displays ollama is running here you can type slash question mark
to understand more available commands like /set you can set session variables those will be used
while invoking the llm also show if I execute show this will show the model details if I can
type again show info yeah you can see family uh Phi2 then parameter size
3B Quantization level
q4_0 like that you can execute different commands you can load a model uh or save the model typing
/bye end the session let us now see how to use Python library to perform llm operation as we
discussed earlier as a first step install ollama package import ollama package then we are executing
generate command here ollama.generate the model name Phi then prompt cite 20 famous people then we
are printing the response from result also you can execute different commands li
ke chat or
completion so let me go and execute this now if I go here now you can see the model response let
us now see how to use JavaScript ollama library to invoke the llm operations ensure this
ollama npm package is installed npm install ollama then import ollama from ollama then some chat
config model is phi then prompt cite 20 famous people uh then we are invoking the generate
method here if you see here ollama generate then passing the model name and prompt then we are just
printing
the response let me now go and execute this now you can see the response from model
let us now see how to directly invoke the rest API if you see here Curl then the URL Local Host
11434 API generate you can use different apis then passing the data model Phi then prompt again
same prompt I'm using stream equal to false also even you can pass some options if required
if you see again here options temperature uh maximum context then stream uh false again
let me now go and execute this Curl com
mand so this fine now you can see the llm response
let me now install the ollama Windows preview package you can download the package from here
ollamasetup.exe once installed you can execute ollama serve to start the ollama server now the server
is started now you can go to another command prompt here you can execute ollama run any model
you need also you can execute any other commands whatever we discussed so this is fine now type
any message I'm going to type the same prompt again whateve
r the options we earlier executed
through Docker container you can execute through this uh Windows Server now you can find the llm
response let us now see how to use some of the other Frameworks like Lang chain and Lama index
Lang chain simplifies working with language models in Python and JavaScript by providing library of
common steps and Concepts I'm using Lang chain along with ollama to implement RAG pipeline to embed my
website content to a vector store and run llm in my local the llm
model Returns the response based
on the user query and the context stored in the vector store to execute this python script you
should install some of the packages Lang chain and Lang chain Community also import the required
packages from Lang chain community and Lang chain I am using web based loader to load the
content from a remote web page then using the text splitter and ollama embedding model model equal
to Lama 2 then I'm using chroma Vector store to vectorize the documents store it
into the vector
store then I'm using Lang chain retrieval QA chain with the required model ollama then Vector store as
the retriever then execute the query the question like here I'm sending what is connection timeout
agent because this blog is about the connection timeout in Apache sling then you can print the
response from the llm model result let me now execute the python script ollama_Langchain.py this
will take time because there is lot of execution need to happen embedding then Vector
store uh
then the model need to be executed this error says the model Lama 2 is not in my local so we
should pull that model so for that execute ollama pull llama 2 now the model is pulled to the local machine you can execute the
python script so let me do that this will take time based on your system
capacity I already executed this let me show the output if you see here the llm providing the
response based on the user prompt and the context provided by the blog data let us now see how to
use llama index llama index is a simple flexible data framework for connecting custom data sources
to large language models I have a simple one page document that talks about adobe experience
manager I'm going to use this document as a context to create the response from llm model this
is the python script as a first of install all the required packages llama index llama Index llms ollama llama index embeddings huggingface these are all the package I installed then import the
required pack
ages here I am using llama 2 model ollama model equal to Lama 2 you can specify the
request timeout here I'm specifying a higher value because my system capacity is not that great
so it will obviously take time so you can set the maximum time based on your system capacity now
I'm reading the PDF documents here simple directory reader I'm setting the context settings. llm
the llm defined here uh then embed model is I'm using this huggingface model BGE small en v1.5
uh then Chunk Size 512 um
chunk overlap 64 you can change it based on your use cases then index
Vector store index from documents what are the documents I read it from that folder documents
then query engine index dot as query engine I'm sending the query what is AEM the response to this
content is available in my PDF document so then response query engine. query then executing the
query printing the response the execution will take time I will show you the response received
from my earlier execution python ollama_l
lamaindex.py this is the response this response is based
on the data provided from the PDF document what is adob experience manager it explains the
details let us now see how to use open web UI a user friendly UI to execute llm operations on
ollama models you can run using a Docker container so Docker run uh the port number you can use this
command to execute the UI I already installed the container I'm going to start the container
okay now the container is running open web UI ensure even o
llama is running now access the
UI Local Host 3,000 so you should be signing up for a account this is a local administrator
account I already completed I'm going to sign in now signed in if you see here select a model
you can see different models available in your local system if you need any additional model
you can run ollama run the model name whatever we did it earlier so now if I select this
Phi this fine now I can go and execute the prompt now you can see the llm response
Additionall
y you can enable some of the settings if you click on this you can see here
General um you can specify the system prompt notification even Advanced parameter uh you
can specify connections if you see here ollama API URL you can specify even you can specify
open AI API key and base URL uh then models even perform some of the model operations
here you can enter the model name or tag then you can download the model uh then you can
select the model to delete it interface you can perform some ch
anges audio images um chats you
can specify some configurations also the local account as discussed earlier even you
can use various tools and Frameworks along with ollama ollama is a fantastic application that
facilitates the quick execution of llm models on your local machine it offers a library of
Quantized models that can be directly run on your local machine additionally ollama provides apis
that can be seemlessly integrate with various applications moreover it supports interaction
wi
th different libraries such as Python and JS as well as integration with various tools
and Frameworks like Lang chain and Lama index that's all for today thank you all for
watching the video and see you in the next video
Comments