okay let's get us started hello everyone
welcome to ACES introducing Intel PVC GPUs before we start I want to make sure
that everyone can hear me Tanishq can you confirm that you can hear me and also you
can see my screen yeah I can see your screen and I can also hear you okay thank you okay
hello everyone welcome to ACES introducing Intel PVC GPUs my name is Zhenhua He and I am
instructor for today's short course and also we have Akhil and Druva who also worked on these
course materials a
nd prepared for the course presentation slides and also Richard will be
the instructor for today's LAMMPS the Molecular Dynamics he will demo how to use the the LAMMPS
on our ACES Computing clusters using Intel PVC GPUs here the online for today's short course
first we will introducing the Intel PVCs on ACES we will briefly introduce the Intel's
PVC its architecture and the ACES Computing platform at Texas A&M High Performance Research
Computing second we will demonstrate how to run models
of different Frameworks such as Pytorch
and tensorflow with the PVC GPUs on ACES system here we will be using the Intel AI models we
will be running some like for example resnet 50 ACES with the PVC GPUs and after that you can
open the output file and see the see the like the performance of the GPUs for example the image
throughput the third part is about how to convert a pytorch model to run on PVC here we will show
you the steps how to convert the model and then you will do some Hands-On
exercises and run your
code on PVC GPU so we have prepared some Hands-On exercises for you so similarly for tensorflow
we will show you how to do that as well the last section is about LAMMPS on PVC this will
be Richard Lawrence he will demonstrate how to run the Molecular Dynamics simulations in the
LAMMPS framework with the PVC GPUs on the ACES system so the first lab is about introducing Intel
PVC GPUs on ACES and here's a picture about the Intel data center GPU Max series PCIE card they
have different GPUs like 1100 and 1550 on our ACES Computing platform we have Intel Max 1100
GPU so first I would like to introduce our ACES Computing cluster ACES stands for Accelerating
Computing for Emerging Sciences this project is funded by NSF the mission of this project is
offer an accelerator test bench for numerical simulations and AI/ML workloads so on this
ACES Computing cluster we have various type of accelerators such as Intel PVC GPUs graph core
IPUs IPU stands for intellige
nce processing unit and also Nvidia GPUs and other FPGAs for example
Intel FPGA and bware FPGAs we have a different accelerators we welcome different researchers in
different research fields to use our accelerators to speed up your workflow second is to provide
consulting technical guidance and training to researchers we have been hosting different
workshops tutorials and short courses like this to researchers so they can learn from Hands-On
exercises to use the novel accelerators on our AC
ES Computing platform and also the third
menion is to collaborate on computation and data enabled research we have been collaborating
with different HPC centers such as San Diego Super Computing Center and also Texas Tech Advanced
Computing Center University of Florida on some benchmarking projects and also we publish some
papers on the our results here is a table of about the accelerators on ACES Computing
platform you can also check the quantities of the different accelerators on ACES Com
puting
cluster so we have 32 graph core IPUs and 16 of them are Colossus IPUs and 16 of them are Bow
IPUs Bow is a little more advanced IPUs than Colossus so we have both and also we have Intel
FPGA Bittware FPGA NextSilicon coprocessors NEC Vector engines but here we also have Intel Optane
SSDs if you have some application that requires a large memory these Intel Optane SSDs can be
addressed as memory with the memverge memory machine so we welcome you if you have any of
these application
that require large memory we would like you to test our Intel Optane SSD so we
have 18 terabyte we have 48 so and also we have Nvidia H100 we have 30 of these GPUs for NVIDIA
A30 GPUs for today we are using the Intel PVC GPUs this is the software development platform called
Intel Arctic sound ATS-P so we have 22 Intel PVC GPUs so this is a the Intel Max GPU 1100 on
our ACES Computing cluster it has one tile per stack per card and it has 56 Xe HPC cores
448 execution units so you can calcula
te it has eight units per core the TDP the thermal design
power is 300 watts and we use PCIE Gen 5 16 lanes card and it has 48 GB HBM2e memory the memory
bandwidth is 1.2 terabyte per second the Peak Performance is 22 Tera flops FP64 Precision at
this Precision so this is some specs about the GPUs if you are interested in knowing more
and I think we have more information there are more information on the Intel website
about the Intel Max GPU series even about the 1550 you can check that nex
t we would like
to talk about the Intel oneAPI tool kits it's a collection of development tools that are
developed by Intel so the users can develop their high performance applications across like
multiple architectures such as Intel CPUs GPUs and FPGAs and etc. and also they have some
add-on domain specific tool kits for example Intel oneAPI tools for HPC Intel oneAPI tools
for IoT and Intel oneAPI rendering tool kits for all of these domain specific tool kits
we will not use them today b
ut if you have any interest about these tool kits we encourage
you to look at the Intel documentation about all these tool kits so you can make use of them for
today we're going to use this one the Intel AI Analytics tool kit so we will run some machine
learning models with this Intel AI tool kit so to facilitate our users to use our ACES
Computing clusters and the accelerators we have created some shared directories on ACES
Computing clusters for example we save some data set for example h
ere image data set
for Pytorch and tensorflow and these are all processed data set image data set so you
can directly use them with your model if you want to train your model with the imagenet data
set you can directly use them and also we have tensorflow imagenet data set in TF record format
so and also for the models we have download the Intel AI models so you can try out the different
models and also we have provided containers in the shared directory as well so it's called Intel
Deep L
earning container and it has been converted to Singularity image so you can use it on
our Computing cluster if you're interested Next I would like to introduce some of the
resources we provide the first one is our web page Texas A&M High Performance Research
Computing if you click on this link you can see our now I'm at the web page of our TAMU
HPRC can you still see my web browser give me a thumbs up if you can see my Google Chrome
browser okay thank you Scott so you can see here we have l
ike how to do we have a lot of tabs
here and today we're going to use the tab to access the ACES by using ACCESS portal and also
you can even check the status of our Computing clusters Faster and Grace are under maintenance
today so you can see all the numbers are zero so the next one is about ACES quick start guide
if you click on it and we have ACES quick start guide here and this is our knowledge base for
the different Computing clusters you can see we also have a FASTER, Grace, and Terr
a so
if you click on the user guide we have ACES user guide here you can learn about different
accelerators for example Intel PVC GPUs here so it will have an introduction about the Intel
PVC GPUs our ACES Computing cluster we have six slurm nodes PVC nodes each with the four Intel
PVC GPUs and how to access these GPUs by using interactive method and also the slurm method and
also you can monitor the PVC GPU utilization by using some command for example sysmon xpumcli I
will introduce thes
e as well in my presentation and the demos as well so next we will be using
the ACES portal to access our ACES Computing cluster I think most of us are ACCESS users
here is the link to the ACCESS documentation tells you how to apply and ACCESS ID and
here's the HPRC YouTube video YouTube channel and we have posted I don't know how
many 94 videos about our short courses you can see for example this is about introduction
to Containers Charlie cloud and a hierarchical module system Introductio
n to Julia and
many others so if you are interested in knowing more about TAMU HPRC and learn about
our short courses you can subscribe to our YouTube channel and you're very welcome to look
at our videos watch our videos and learn from our resources okay this is our helpdesk email you
can send us a email if you have any questions so next is about how to access the ACES Computing
cluster we will be using the ACES portal which is the web-based user interface for the ACES
cluster it's based
on open on demand it's an advanced web-based graphical interface framework
for HPC users so this is the is open on demand portal next you will do authentication via
CILogon so you can see we will choose ACCESS CI XSEDE and also you will need to provide your
ACCESS username and password and then click log in after that I guess you will need to do the
Duo two Factor authentication you will need to approve for example on my phone I will have to
approve it so that's the process to get our ACES
cluster so next we will do the click on the tab
clusters ACES shell access will get on the shell you will you will be able to see the log-in known
name in the Prompt so let's do it together so I will show it now please follow along and then you
will be able to work on the Hands-On excises and then learn how to even run some demos with our
Intel PVCs so hover your mouse over the portal Tab and click on the ACES portal (ACCESS) this is
the ACCESS this will direct to so we will select the iden
tity provider ACCESS CI XSEDE and I'll
just click on log on so I have saved my ACCESS username and password you may need to type your
username and password so after that I click login here I will need to do the Duo two
factor authentication I will approve on my phone so now we are on the ACES ondemand portal
so we have you can see a different tabs here so before I continue I would like to know if
everyone is on the ACES ondemand portal now okay Elizabeth did you encounter any error or
do y
ou need it more time I needed time I have to get to this side I'm using another system so
that I can be looking at it I want to know which website can you repeat your question is it Texas
A&M High Performance Research Computing that is what I want to know which one to use on my
end from here which one to use we use the ACES portal okay let me okay now once you click
on the ACES portal as I did before you will need to choose the ACCESS CI XSEDE and then click log
on and then you will need to
provide your ACCESS username and password and then go through the
Duo two Factor authentication and a click login it should bring you to ACES on demand portal
and I would like to introduce the features of our ACES on demand portal for example here if
you click on the files tab for example this is my home directory and the scratch directory
if I click on that you can see the files I have and we will be downloading these I would
like to show you how to use the file editor on the portal so wh
en you see some files for
example here and you can use edit we will do this a little later as well because when
we edit I think it'll be easier for you to edit in this file editor if you prefer to
use vim of course you can use vim as well and the next one is about a job and we have active
jobs composer for example when you submit a job to our ACES Computing cluster you can check the
status of the job whether it's completed or it's pending or it's running so and also cluster
here we will be
able to start ACES shell and also we have different interactive apps here
the VNC and also we have NextSilicon VNC and if you use the Jupyter notebook Jupyter lab we
also have these Rstudio tensorboard we also have Affinity groups you are welcome to look at our
Affinity groups ACES group and our HPC group and a dashboard so probably this will show you some
a what's going on here so it will show you some of the disk usage the limit I think it's under
maintenance maybe it doesn't show the nu
mbers disk usage so you can try out these dashboards I think
maybe this one is probably the new one after that please click on the clusters and click on the ACES
shell access so we will be on the login node of the ACES Computing cluster I'm on the login
node one I think we have several login nodes the Clusters drop- down menu please cluster drop
down menu yes that's the one yes this is the one you should click on the underscore ACES shell
access so you click on this this will direct you to
here so now I'm on the login node login
node one so now we are all on the login node of ACES Computing cluster so this is a success
welcome to the ACES login node next we on the login node we can check the PVC slurm nodes
the status so for from here we can view the PVC nodes and the number number of GPUs so we
can use these commands if you have the course presentation slides you can directly copy and
paste the commands into the terminal so for example here I will do pestat -p means partitio
n
-g shows the number of the GPUs now you can see we have these nodes I guess six six PVC node slurm
nodes and they are in the PVC partition on each node we have four PVC GPUs and now the are their
state is reserved they have been reserved for the short course so I reserved them so today we
are going to use all these use the PVC GPUs next we will need to copy the training
materials to your personal directory first we will cd $SCRATCH to navigate
to your personal scratch directory and we ha
ve save the files for this short
course at scratch/training/aces_pvc_course so you will need to use this command cp -r
scratch/training/aces_pvc_course $SCRATCH to copy the the training materials to your
to your personal directory and then we you can enter your directory your local copy as well
so I think I will give I have download this into my scratch directory so I will give you like
two minutes to copy the training materials to your personal scratch directory so it should be
like this
the second lab is about using PVCs on the ACES Computing cluster these are mostly
demos but you are encouraged to follow along and around the the models like in pytorch
and tensorflow framework with the Intel PVC GPUs and also after that you will after
running the models you will get an output file that shows the performance of the Intel
PVC GPUs as well so here first we will need to set up the environment for the first one
we will do pytorch model so we will need to set up the environment
for pytorch models here
actually we have prepared a job file where the environment has been set up there so we will
go through line by line so you know how the environment was set up so first we will do
module purge to clean your environment and then we will module load the intel/AIKit
we use this version 2023.2.0 I think this is the latest version but I need to check. So this is the Intel AI Toolkit. This has been
installed as modules on our ACES Computing cluster so you can just simply do
module load Intel/AIKit
you can select a version and I think we have another version as well and then we will do module
load Intel toolchain and we will be using the 2023.07 next we set a environment variable called in name for for this demo
we will be using a pytorch so so we will be cloning the environment in the
AI tool kit into your directory so we give make a variable named aikit-pt means pytorch-gpu-clone
and if this this means like if the environment doesn't exist it will clone the
environment from
the ai tool kit so conda create -n $ENV_NAME --clone aikit-pt-gpu so this will help you to
clone the the the pytorch the environment from the AI tool kit and then we can source activate
the environment this is all in the pt_demo.slurm job file you if you have copied the training
materials to your personal scratch directory it should have been in the pytorch folder we
will go there and look at them together so okay I'll go back to the terminal and if you
list and cd pytorch
you can see the job file here so has everyone can you open the file in
the file editor pane instead of the vim yes that that is also okay if you use vim you can open the
file from here if you want to use our file editor that's also okay here I think at the beginning
you should be you can enter the files tab you can click on the scratch user and your username
if you click on this and I believe you have copied the aces_pvc_course here and we have three sub
folders one of them is pytorch if y
ou click on pytorch you will see some files ignore this one
because I run this job this morning so you can see I have some output file here but I want you
to open this slurm job file you can click on this button and then you can see the edit click on edit
you will be able to open it from here oh sorry so the first part is about the job specifications you
are requesting you will see the job name and the wall time is one hour requested and we request
a one task CPUs per task is eight and we w
ill use one node the memory is 100 gigabyte and the
output file will be pt_demo and followed by your job ID we will use the gres we will request a GPU
and we want the PVC GPU and one PVC GPU from here and from the PVC partition today as I said we
will be using the reservation called training so this is the resources that we requested and
the next part is I have explained to you about the environment setup so and then active the
environment this will cd to the resnet50 Intel AI models resnet
50 folder training so we will run
this training script this is prepared by Intel so we can just run it and see the performance of
the Intel PVC GPUs okay I will come back here do you have any questions so far can you see the
file in the file editor on the open on demand portal so next we will be running we
will submit a job to the ACES Computing cluster so for here sbatch do this command
sbatch pt_demo.slurm submit job to the ACES cluster of course you can use the pestat
-p pvc -G and we c
an do this I will just -n 5 maybe every five seconds access
okay we can see we have some students get on the using the PVC now so let's see
yeah we can see we have some users using the PVC GPUs that's great and I think my
job has been done I guess I can use squeue -u u.zh this is my net ID so you can see I have
no jobs running I think my job has been finished and the job ID is 29466 so we can go back to
the file editor I'll just refresh this page if I open this you can see it tells you are
using
the bfloat 16 Precision for training the resnet50 and also you are using the data set for training
because I didn't specify a data set directory so for the imagenet image net so it's using the
data set of course if you want to train the model on real data set you can specify the path
to the real data set and the resnet50 training running 1 stacks actually this is very small
because I only let it around like the first 20 iterations so it's very fast of course you
can let it run more E
pochs if you really care about the training get a better training model
so here for demo purposes I just let it run like 20 iterations so you can see and all these
and also training performance batch size we use 256 and the throughput is 1230 images per
second so this is the output file if you want to run more Epochs longer and you can do that
after the class in the file here let me back to the slurm you can see this file you can set the
path to the data set and also you can specify the Pre
cision that you want to run the model with
and also you can set set the batch size as well I just used some default parameters so this is just
for a demo purposes so that's great every some of some students are using the Intel PVC GPUs
that's a great start I think let me go back to my presentation slides actually I know some
users some researchers would like to prefer using the python virtual environment this is just for
another alternative for reference so please do not type these commands
into your terminal now because
we will not show the python virtual environment today you can try this after the class actually
I have tried all these commands already so it should work here I would like to point out that
if you want to install the torch and the Intel extension for Pytorch when you use the AI toolkit
actually this has already been installed for you so you don't have to care about you don't have to
worry about these but if you want to install them by yourself you have to car
e about the versions
so for here for example the torch is 1.13 and you will have to use the matching version of Intel
extension for pytorch now here also use 1.13 and what when you want to do the distributed
training you will also need to install another Library called oneCCL Intel one API Collective
communication library findings for pytorch also you need to find a version that is matching
with other packages these are all available on Intel website and you can find the matching
versions
of the different packages so if you if you want to try the python virtual environment
but today we we don't have time to work on these so we change to pytorch directory and then we
submit the job file already so that's great okay next is about tensorflow but it's 51 minutes now
probably I want to take a short break how about we take a seven minutes break and then we come
back at 11 so let's get back and at 11 and we will continue from the tensorflow models hello
everyone welcome back we wil
l continue with the tensorflow resnet50 model demo also first thing
first we also need to set up environment for tensorflow models and for here similarly we
also use the Intel ai kit so we just a clone the environment why we want to do a clone of
the environment because you have the freedom to install other packages into the environment
for your own project for example probably you need a like tqdm or tensorboard that that are not
included in the aikit-tf-gpu but you can when you clone the
environment you can source active the
environment and then you can install the module that you need for your own project so but for
today's I think for today's short course we don't need a lot more but for the future benefits so
and this is also in the tf_demo.slurm so so we can change to the change directory to the tensorflow
folder and then sbatch the tf_demo.slurm let's see so my terminal is still active so let's
see now I need to change to tensor flow and you can see the tf_demo.slurm s
o we use the
command sbatch tf_demo.slurm and you can see your job ID here 29487 this is my job ID and
also I can for example watch squeue I will try --me yes Richard is working thank you so
you can see my job is running ac026 PVC node in the slurm so it's working and also we can
watch pestat -p PVC -G so we have some users using these PVCs so it's running I guess this tensorflow
demo is a little longer it takes about five minutes so if you have any questions you can
let us know okay we ha
ve been using this user is using ac086 please give a you can either
thumb down if you have a problems running the jobs so while it is running I would like
to go back to my slides and see what is the next okay the next one is about converting
your pytorch code to run on PVC so this is the next one but I still want to see how many
students are on the PVC now we have a few more okay while it's running I will continue
the third lab pytorch on PVC so if you have a pytorch model pytorch code how
can you convert
it and make it work on the Intel PVC GPUs here we would first we would like to introduce Intel
extension for pytorch it's a python package for extending the pytorch models to run on a Intel
platform like the Intel PVC GPUs so first and first we will have you will have to add the
following import statement to the beginning of your script for example here import Intel
extension for pytorch as ipex so this is the first thing so we will have to incorporate to
make it work on In
tel PVC GPUs it's very easy right the first step and the second step is to
move your model and a criterion and to the xpu it's also easy so you can do just your model
you have created your model and criteria and then you just do model to xpu criterionsim to
xpu so move it to xpu and also after that you will need to apply the ipex optimize function
against the model and optimizer objects for here this is the arguments ipex optimize it will
take your model and up your Optimizer and here we us
e a data type bfloat(16) and actually it also
support fp32 you can try out to the bfloat16 here and after the class you can try the fp32 as well
because for the next when we do the fifth step we will be using the bfloat 16 so they have to be
consistent for here also the bfloat 16 will give a better throughput compared to fp32 so here we
choose bfloat 16 here the first step is move the data and target to xpu very similar in other like
a GPU code in the training loop you have to data when you
unpack your train data loader you have
the data data to xpu target target to XPU you so these are very simple change as I said the fifth
step we will use the Auto mixed the Precision here with the bfloat 16 data type we will use this
context manager torch xpu amp. autocast it accept the arguments enabled to and a data type to be
consistent with the previous one so torch bfloat 16 this is all the steps so this is just a few
lines of change in your code and it will be able to run on the Inte
l PVCs here I have prepared
a exercise file called a cifar10_pvc_todo.py I have listed these several steps in the to-do
file so you can use the our file editor of the open on demand portal you can add all these
changes to the file and read through the code and familiar familiarize yourself with how
to change the code the pytorch code to PVC code and after completing all the to-dos in the
Cifar10 PVC to-do file you can modify the slurm file and submit the job for example you can just
as you
will have to change the python command for example python you want to change the file name
to be cifar10_pvc_todo.py okay let me show you here I will go back to my pytorch folder so go
to the exercises folder exercises here you can see the todo file here cifar10_pvc_todo you can
use the of course you can use our file editor here I'll just close too many oh there has been
this is for the pytorch and click on the exercise folder and we have this todo file you can click
on this and then edit
so you can see Step One Import the package of Intel extension for pytorch
and give it an alias ipex write your code below so you can see import fix me you will have to
replace fix me with the correct code as shown in the slides so this is the first step and after
that you can read the code through and then in Second Step move the model and the Criterion to
xpu so now we have our model and the criterion in the second model or in the Second Step you
will need to do model equals model. 2 and t
hen xpu and the criterion spu so you have to replace
the fix me with the correct code so this will be the several steps you change it to to a PVC code
if I go back you can see we have a solution but I don't encourage you to look at the solution now
I want you to First Complete the the exercises first by yourself and then submit it to our
Computing cluster and also we have another file called cifar10_cpu this doesn't have the model
to xpu and all these lines or I just commented all these lin
es so the the the the pytorch model
will run on a CPU you it will take a long time I don't encourage you to run this file during this
class if you have interest you can after you can run it after the class so it takes several hours
but for the PVC it could just a few minutes so the we just want you to feel the speed up by
using the Intel PVC GPUs so now I will go back to this slide so I will give you like 10 to 15
minutes to work on these exercises the the to complete the to-do steps in thi
s cifar10_pvc_todo
after that you can submit the job I have prepared the solution I always want to use the vim but
I think I want to advertise our file editor here so you can see this is the solution
and the code changes are like you import the Intel extension for pytorch and also you
move the model and the criterion to xpu and we use the ipex optimize and you take the model
Optimizer and a data type and for here we in the training loop we move the data and target to
xpu and also we use th
is torch xpu amp AutoCast manager and then to wrap it so this is just a
few steps so you're welcome to try it on your own code okay I guess we will continue tensorflow
on PVC so tensorflow on PVC it also has a corresponding package called Intel extension for
tensorflow for pytorch we have in Intel extension for pytorch it's also this actually enable your
tensorflow model to run in xpu like GPU CPU Etc and also here you can check the version of
course using -- double underscore version and t
o print the version you used good thing is
that if you install Intel extension for pytorch in your environment the default device will
be the Intel GPU for example if you have the Intel PVC if you installed this package in your
environment it will automatically use the Intel PVC GPU you don't have to do any code changes so
that is a great news right that's a great news for tensor flow so if you use tensor flow you just
need to install the Intel extension for tensor flow into your environmen
t and then you don't
have to make any code changes it will pick up the Intel PVC GPU and run your model with it so
so this is a great news so this is for great news for the tensorflow users here is a very simple
exercise you don't have to change anything you just needed to actually modify the slurm Jobfile
tf_cifar10_pvc.slurm and submit your job you can see it runs on the PVC already so this is a
screenshot of the job file you you can add comment these two lines and to let it run on the PV
Cs
so let me go there and try that this is what is a change tensorflow and I guess it's in
the exercises so we have the slurm file here I'll just use name so you can see please and comment these two lines oh yes here and then submit this file I have
a job ID number 29502 I'll just watch squeue --me it's pending now request node not
available maybe let me see what's happened there okay because all the nodes have been reserved
so if you do not put the reservation your node will not be able to
find it so let's let's add
this sbatch directive here -- reservation equal training so let's add this line and try that again so because it cannot find a node it can use so now I have the new job here
so because it's in I specified to use the reservation reservation so
he can find it and I'm using the ac086 it's running so it's just a line I will
check you can also check the PVC dash G oh okay just so we have some users using these
PVCs that's a great news and I myself is using the ac086 s
eemingly I'm a very alone on this
node I need somebody to get on this Node as well so okay this is very easy and I would
like to actually look at the the file so you can see actually we have imported the Intel
extension for tensor flow but actually it doesn't have any these use any of these it
doesn't use the the itex at all so I think if you have the the Intel extension for
tensor flow installed you don't even have to put this line of code to it let me try I
I'm very curious about this if
I do not use these two lines can it still be landing
on the PVC nodes or it will give me some first let me see let me sbatch TF_ okay oh
sbatch so 29507 I will just watch squeue --me so it's also running ac086 how can I check if
it's really using the PVC GPUs we want you to always follow the good practice and we want
you to probably when you run this you want to use the VNC for checking the PVC usage I
will introduce it a little later but we can see actually it's it's landing on the PVC
n
ode do you have any questions so far for the tensor flow tensor flow is a easy easy
one right so you don't have to do any code changes okay I think I will move on we have
some PVC monitoring tools I just introduced the pestat -p partition PVC -G to show the number
of GPUs and this is one you can use this one to check the availability of the PVC GPUs and also
actually we can monitor the PVC system activity use the sysmon it will tells you the memory being
used on the PVC and other informatio
n as well another one is to use the Intel xpu manager it's
xpu m manager CLI command you can use the stats dash d and you can give it a device index for
example you you use zero and you can give it zero and you you will be able to Show the memory used
and GPU utilization oh because sysmon if you just use it now I guess it's it's not on the you are on
the login node now so so it's not on the PVC node so sysmon error it's it's expected so only if you
data on a PVC node you can use this sysmon
here is how we can directly land on a PVC node for example
we can use the VNC you can start a VNC I have introduced the about the interactive apps so you
can click on VNC choose the node type for example Intel GPU Max and the number of GPUs for example
here I choose one if you have distributed training you can choose more and for example two three or
four and number of hours number of cores total memory and after that you will be like you will
after a short wait you you will be able to see
this blue button called the launch VNC click on
the launch VNC will be bring you here for example in the VNC we can run some jobs I don't know why
I just covered my user username so I think that's okay and then here I change directory to the to
the pytorch exercise folder and I do some module load of the Intel AI tool Kit and also module
load of the Intel toolchain i also activate our cloning environment called aikit-pt-gpu-clone and
here on this note I run this cifar10_pvc solution file a
nd after that you can use watch sysmo to
see the memory usage and this shows your process is attached to this GPU this is a group practice
to check the PVC memory usage and but if you want to check the power and temperature of the GPU you
probably want to run the command xpumcli stats -d device but for today I guess I will just show this
I did this before the I think it it was yesterday yes it was yesterday I tested and you can also
use the hour vnc session to check the PVC usage I think th
at's all for the AIML part in pytorch
and tensor flow next I would like to invite my colleague Richard Lawrence to do some to show some
demonstration and maybe exercises on Runing lammps the molecular Dynamics simulation on PVC GPUs
Richard are you there hi this is Richard I'm a user support specialist at Texas A&M and I will
be talking to you about running lammps on PVC GPUs I hope you can see my screen it should
look pretty much the same as the other one so what is what is lammps has anyo
ne heard of lammps
before lammps is a molecular Dynamics simulation engine and lammps has a modular back end for GPU
acceleration so there are two main lammps packages that serve that that purpose the GPU package
uses OpenCL or Cuda to communicate with GPUs and this this package divides work among the GPU
and the CPUs there's also the Kokkos package which uses a different type of declaration language
entirely and that supports many more types of accelerators including different types of CPU
s
the default strategy if you're using a GPU with the Kokkos package is to put every calculation
on the GPU I've used both of these they they do both work well today we will be focusing on the
GPU package which divides work among the GPU and CPUs for other purposes the Kokkos package
may be better I just don't have an example prepared what's different between these two is
primarily there's a slight change of syntax when you launch your executable depending on which
package you're using to
communicate with the GPUs mostly your input files are not going to
vary very much lammps is pretty good about you know behaving as though this is a modular back
end so ACES we provide a build of lammps that's compatible with the Intel GPUs and I have put it
in a module so that is a the lammps module which has the version named 3August2023-Intel-2023.07
so here the first phrase 3Aug2023 refers to the build of lammps itself so that is a release
of lammps from that time and the second phrase I
ntel 2023.07 refers to the version of the Intel
compilers that were used to compile the lammps code into executables and because the module
system on ACES is hierarchical you must load your tool chain before you load your application
so the syntax is module load Intel 2023.07 then module load lammps 3August2023-intel-2023.07
however the because there are only a few things available on Aces that are specific to this
tool chain it's actually good enough to just say module load Intel 2023.07 l
ammps and
there's only there's only one reasonable option that the module system can consider so
it it picks the right one so it saves you a few keystrokes where is this installed when
you load the module it sets an environment variable named hprc root lammps which just
points to the directory where this build is installed so we can take a look at it and
see where it is you could consider this to be a small exercise if you wish module load
Intel/2023.07 LAMMPS sorry it's hard to see on my
screen I'm trying to type an underscore
but it's kind of invisible don't know what happened so this particular build of lammps
provides an executable named lmp_oneapi if you recall oneapi is Intel's name for the
collection of compilers that are meant to be used across multiple types of devices it is
common practice in the lammps community to name your executable after the type of devices it is
meant to be used with so for example you might see lmp_MPI or you might see lmp_Cuda these types
of things are common so Nvidia has followed that Paradigm and they've decided that this executable
will named lmp_onempi otherwise other than the name change it functions identically to all
those other lammps executables you may have seen so how do we how do because I said that this
build of lammps distributes work between the CPUs and the GPUs that means we will be using MPI as
the layer for communicating among the CPUs I've discovered that MPI doesn't play so well with srun
it might be a
bug or maybe I just don't understand how srun Works but in order to get the correct
performance we're going to be using sbatch so sbatch file name and somewhere in the file there's
your MPI run command so the Syntax for mpirun is -np for a number of processes and then and then
and and here represents any variable number of processes for the MPI thread configuration we have
here some suggested variables I don't know what all of these do because I'm not actually an MPI
expert I got these from
our Intel representative who suggested these so we're going to try them
for this demo it is not known to me if there is a better combination for your specific use
case one final new thing is called the compute aggregation layer this is a tool that Intel
released which is a it's a helper for when you are running applications that use both MPI
and openCL so for example the lammps GPU package that we're using today is an example that uses
both MPI and openCL so naively if you MPI run a bunch
of processes then every process can send
openCL calls to any of those GPUs that might result in communication bottlenecks or other
inefficiencies So Cal the compute aggregation layer works as a middleman between your MPI
processes and your openCL devices it collects incoming openCL calls from the MPI processes and
more intelligently handles them out to the open Cal devices to reduce communication bottleneck
that could occur due to the MPI processes not not respecting each other very well so
so to speak
now the the utility of this of course entirely depends on how much work Intel puts into making
sure that it makes good choices I'm not really an expert on that I just know that we're going
to be using it today because it is essentially recommended by Intel so let's give it a try in
order to use Cal run you just prepend your MPI run with the Cal run executable that sets up an
environment so that it's going to intercept those calls from MPI processes redistribute them to the
GPU
s intelligently Cal run executable is provided with the lammps module so you do not need to load
any additional modules it is already in that same installation area in order to Benchmark lammps we
have some test files that were provided by Intel so these are Intel's variant of common molecular
system input files for lammps I know what some of these are for example on the bottom left we
have LJ that stands for Leonard Jones which is basically the billiard balls model the atoms can
bounce off
of each other when they are in close contact and that's it and on the top right
we have rodo which is short for rhodopsin is a protein that is solvated in a bilipid layer
with ions it's just about as complicated of a molecular system as you can design so we run the
full gamut here of very simple to very complex and you can try any of these out the default one
is LC it's actually one I don't know much about but we can mix and match for fun so try any of
these that you think are relevant to
you the test files are found in hprc root lammps/apps/TEST so
let's navigate to that directory and see what's there yep those those files are there I'll actually return to the previous I was
in my home directory so I guess that's not very helpful okay so I have provided a demo
slurm batch file which probably should be in the directory that you already copied so
earlier we made a copy of the ACES PVC course and one of the three directories in it is
named lammps and in that lammps directory t
here's just this file the lammps demo.
slurm so let's take a look at what's in this file so I here have my ACES file
Navigator open and I have navigated to the ACES PVC course lammps directory and
in it you can see the lammps demo. slurm file so I'll just click the view button to open
this in a tab so we can read it and go over what is in here so the these top few lines are normal
looking slurm parameters things like we want to be on one node we need some memory we need some
time and in th
is demo I've picked one GPU and 16 cores so as your exercise you may choose to vary
these recall I said that the lammps GPU package will distribute work among the GPUs and the CPUs
so as you vary these numbers you will see varying performances here's some printing out
some fun facts about the node that your job will land on I find Handy here's where
we load those modules load Intel 2023.07 load lammps 3August2023-intel-2023.07 I
create a variable for the output because lammps does produce a
log so we'll just have
a log directory and create it to make sure that it exists here's the MPI communication
settings I previously told you will be that were suggested by Intel I don't have a reason
to change these so I've just put them here if you're an MPI expert you might choose to play
with those and see what happens next we need to navigate to the directory where those
input files are located in order to use them this now this CD will happen in the
running job though this will not h
appen on your command line you're in your command
line You'll simply be located in the folder where the slurm file is and then I've created
a variable named in which I just choose to be one of the a input files that is located
here in the test directory so this default that I picked for you is LC perhaps we
might be interested in what is LC I don't know it says it's biaxial ellipsoid mesogens
in isotropic Phase I know what the word phase means the rest of that is a mystery to me what
were
we doing we were reading this file okay so let's read these lammps arguments if you're not
familiar with molecular Dynamics I can give you a real quick rundown on what these options do
the first option is obvious Dash n just tells which input file we will read -V is it sets a
variable the variable's name is capital N the Intel input files all have this variable defined
which turns on and off the the Newton parameter of molecular Dynamics so in molecular Dynamics
you have to say you're Compu
ting the forces on 100 atoms so first you'll compute the force on
atom number one due to atom number 99 and then at a later time you compute the force on atom
number 99 due to atom number one if you're in serial or in a shared memory environment you
might notice that those are actually the same magnitude so if you record the magnitude the
first time you calculate it then the second time you need to calculate it you can just read
it out of memory in save time but we are not in a shared memor
y environment we are Distributing work
of across multiple devices including GPUs and CPUs So reading the result of the force calculation
from the previous time that it was done isn't actually faster it's actually slower it it is
faster to Simply recompute the force a second time so we turn off Newton's third law here and
we say we're not going to assume that the force on a is the same as the force from a because
it's faster to just recalculate so off PK is which package we're using for our
which package
we're using to to access GPUs we're using the GPU package not the cocost package if you were using
the Kokkos package you then this would be PK KK for Kokkos and then there's an additional argument
when you use the GPU package package which is how many GPUs so because we previously defined how
many GPU GPUs up here with the gres parameter instead of writing that number down again I'm
going to read it out of slurm using the slurm variable slurm GPUs on node that way if you deci
de
to change this number up here lammps will get the updated number of GPUs that are assigned to
this job then we have the flag SF GPU which which means that we're going to be assigning Force
calculations to the GPU I believe that's what that means so some amount of work will be assigned to
the GPU and the other calculations will be done on the CPUs lammps will output two different
ways it will either output to a log file or it will output to the screen we don't need both
of them so we're
turning off one of them we're not and we're going to set the other one to the
output file in an output directory we previously made and for the special case of rhodopsin
there's an additional parameter named D which has to do with diffusion and we're
just making sure that it's set to zero and I don't even know what it does sorry about
that so the last part of the job is we have to put all that together we have to say
we're going to do a calrun mpirun number of processes equal to the number
of tasks
that we request it up here so every core gets assigned one MPI task lmp_onempi
and then all those arguments that we saved so that's that's that's what this
file does are there any questions about what's in this file or would you like
some time to edit the file otherwise the next thing we're going to do is
submit the file to the batch system and see what it does because I did not put
the reservation flag in the batch file if you wish to use the reservation you will be
providing th
e reservation flag on the command line instead so it's sbatch --reservation
equals training space the name of the slurm file with new how much faster with Newton's third
law off I don't know for every case whether it's much faster or not with Newton's third law off
that's one of those things where you you would need to do the experiment to find out it's
off because at least in theory it's better to be off when you're using GPUs but whether it's
actually better you would have to try it and f
ind out in my experience it depends on what kind of
system some systems don't have a lot of force calculations for example the Lennard Jones the
billiard balls model there are very few Force calculations so Saving Time on a handful of
force calculations doesn't really matter but then you have like electrostatic models like EAM
the embedded atom model where there are a lot of force calculations so having Newton's third law
matters more for those cases and no I don't know all of them I'm not
really a chemist I just know
how to push the button that makes it go how many GPUs per node are they all in use we can double
check pestat -p pvc -G so we have here several nodes that are reserved today each one has four
PVCs on a node so you can request a number of PVCs from one to four for this experiment the
default in the file was four was one but you can change that can lammps use different GPUs on
the same node for distinct MPI ranks in theory yes however because we're using calrun as
the
integration layer between MPI and the GPUs we are not really in control of that instead we're
deferring to Intel on their expertise for the best use of MPI with their GPUs if you didn't
use calun you could manually specify how your MPI ranks are getting assigned to GPUs in the
case of for example the Kokkos package where all of the work is going to be moved onto the
GPUs it is better to have just one MPI rank per GPU so they're not competing with each other for
access to the GPUs for
the GPU package where work is being divided among the GPUs and the CPUs it's
more complicated and in general I don't know the answer someone is pinging me oh that's unrelated what were we about to do I
forgot already thank you Carl for your good questions I'm sorry I don't know
six ranks they are in parallel so MPI run launches the 16 processes together
we can explore that we can explore that as we submit some jobs do we have time I
think we do yes so we'll explore that Carl after we submit
some jobs so I just need
to go to my directory where I have these files does everyone have a directory
like this with this file so I'm going to do sbatch --reservation equals training
lammps demo.slurm there is an underscore in the file name just for some reason it's
not showing on my screen and I don't know why yeah I don't know that's weird so I submitted batch job 29518 so I have been assigned to node AC026 and my
job is running so ac026 psux oh looks like I'm too late yep the job is al
ready done if you want to
catch it in action we'll have to make it run for longer so let's take a look at the output so we
get two outputs here we have a lammps demo. job ID number which just has a handful of statements
explaining what was run I always like to print out these statements because if I come back
later and I forget what I was doing it can be very hard to figure out what I was doing so here
we have a record now that this job ran on ac026 and we had 16 processes and the log file
went
here so we could take a look at that log file if you know how to read lammps log files this
particular problem statement had 32,000 atoms and the the 16 MPI processes were divided into a
4x2x2 grid of MPI processes then because there's a replicate statement the system is replicated 16
times up to 500,000 atoms that's standard practice for lammps benchmarking we can scroll down a bit
and you can see that here there is one GPU device that was initialized it happens to be the device
numb
er zero if we scroll down a bit more you can see that the 16 MPI processes are all able to
see device zero if we scroll down some more we can see that the performance over oh sorry this
was the warm-up 10 steps if we scroll down some more you can see that for the 840 steps after the
warm-up 10 steps the performance was 41 million atom time steps per second that's a a standard
unit of performance for lammps benchmarks we can see that this one spent most of its time doing
pair calculations an
d other which probably has to do with waiting for communication between the
devices the total wall time was only 19 seconds which is why I had I didn't catch it live we
can try again and I'll try to catch it live sbatch squeue same node there we go so if I hit
it with ps while it's running you can see that there are many many copies of lmp oneAPI that's
because the MPI run launches many copies of the executable and it's the job of MPI and Cal to
make sure that all these processes are talkin
g to each other correctly I believe that means
we have enough time to try some experiments so is there another combination you would like to
try perhaps a different resource count more GPUs perhaps a different molecular system one besides
the LC I don't have a specific exercise except we can make some changes to this ex to this demo
file and see how see how what the how the result is different is there an optimization command
we could test what were the choices again let's review the choice
s oops so if we look here at our
lammps demo. slurm let's suppose I go to edit the file the first few parameters are requesting
resources so one option we could do is we could change what resources we request we could request
a different number of PVC GPUs any number from one to four I don't have a multi-node example setup
so it will have to be on a single node we could also change the number of tasks from one to I
think definitely 64 is fine I don't remember how many cores are on the Node
is there 96 cores on
a node does anybody know but I've tested up to 64 and it worked fine and then the other option was
we could pick a different input file so here I we picked LC we could pick one of these other ones
instead airebo dpd eam lj rhodo snap sw tersoff water can you specify which GPU to use instead
of lammps pick automatically because the GPUs are allocated by slurm it's not a good idea
to try to manually pick a GPU because if the if slurm gives you a different GPU than the one
you hardcode into your file it your your process won't be able to launch at all generally we let
lammps pick the GPUs but to illustrate what I mean why don't we just change this number from one
to something else for example we could change it from one to three so if I click save here now
if I launch that it's going to try to use three GPUs and we can see how the output is different
Which choice has the most atoms I don't know which Choice has the most atoms if I had to guess I
would say t
hey all have about the same number of atoms the standard practice is to pick a a
system and then replicate it up to a specific size so for for small for crystals that only
have like one atom per unit cell you replicate it 500,000 times but for something like rhodopsin
you don't replicate it at all instead you just add water molecules until you get to the Target atom
count that's a standard practice in the lammps community but I don't know for a fact how many
atoms are in each of these files
I didn't explore all of them so we look in the logs we can see
there's now a log file for the job that ends in 23 so if we explore that we can see what's
different same number of atoms but now it says that it found three devices numbered 0
to 1 and two it did not find device number three that's because slurm arbitrarily picked
for this job to have device 0 one and two and to not have device 3 next we can see that it
initialized devices 0 to 2 On Core zero that means all three of the GPUs a
re visible to core
zero all three of the GPUs are visible to core one if this was all there was to the story you
would have a huge communication bottleneck as every core tries to talk to every GPU so hopefully
something is going to stop that from crashing your system and I believe the meaning of it
is that Calrun is going to stop that from crashing the system because calrun is
going to intercept all of the openCL calls from the cores and redistribute them among
the GPUs intelligently so we
don't have that bottleneck we can see that the performance
increased from from 41 to 49 million atom time steps per second so with three times
as many GPUs it took a little bit less time the the the communication time increased
dramatically it used to be mostly pair and other now the communication time and the modify
times are more significant so what that's telling me is that the additional work done by the
additional GPUs was kind of negated by the additional communication time that was
needed to
to get those GPUs to work with each other these are the kinds of things you need to be aware of
when you have lots of choices of ways to divide work among devices the wall time was overall
less but not by a lot with nodes equals one you're limited to tasks equals cores tasks equals
cores is just one way to ask slurm for Cores you can also ask for CPUs per task you can also ask
for all the cores on the machine and a number of task is just my favorite way to request cores
there is
more than one way in slurm to specify your resource that you need usually they're
they're equivalent lammps for example doesn't care about the distinction between a task a
process or a thread it it will just choose it will just launch processes and assign them
to cores as best it can but there are other applications where the distinction between a task
and a thread actually matters but not for lammps specifically if you request n tasks equals number
of cores plus one I don't know what happe
ns if you request contradictory resource requirements
from slurm I'm guessing that it has some kind of internal tiebreaker and it just ignores the
parameter that it doesn't understand also you have to be careful because some of the slurm
parameters describe the node you want to land on and some of the slur parameters describe the
resources that you need so you could say I want to be on a node that has 96 cores but I only
want to use two of them that's something you can do in slurm so just y
ou can't just assume
that all of the all of the parameters get you what you need that's why I try to stick with
just using end tasks because I understand what it does so I don't have to worry about slurm
jobs that don't that don't work properly you could see increase the nodes tasks and see
what would be the performance performance boost you and Shen would like to know what
happens if we vary the number of tasks would you like to suggest a different number of tasks
than 16 perhaps double t
o 32 shall we double to 32 and see what happens do you want me to put
the number of GPUs back to one or leave it at three GPU is back to one okay so
I clicked save the button changes to darker color after it's saved
and then we can submit that demo again and it is running on AC026
all my jobs are landing on AC026 today okay and it is done so now
we can investigate the log for the job that ends in 25 and see how it's different you can see that initialized more processes and the number of ato
m times steps per
second actually decreased so it used to be 41 atom times million atom time steps per second and now
is 31 there's a couple possible explanations one is there weren't enough atoms in this Benchmark to
keep 30 to cores busy if you initialize more cores but don't give them any work to do you're just
essentially wasting time initializing cores so in order to put 32 cores to work maybe we needed
to have more atoms that's one possible explanation another possible explanation is
maybe the GPU is
getting hot because I used it repeatedly you can see I keep landing on device zero on node AC026 so
it's possible that that the node just is too warm now but I haven't really gone into detail about
about measuring the temperatures Carl asks could you re-elaborate more on why the comms overhead
is so much greater with three versus one GPUs I'm not a lammps expert I just know how to push
the button to make it go but my experience with lammps is that you you want to give every
device
that's going to do calculations enough atoms that it's going to be busy doing calculations for
a while because communication between devices tends to be the slowest part of simulations so if
you if you give a CPU or a GPU like 10 atoms and then immediately say hey give me the results back
you're spending too much time communicating you're going to bottleneck your simulation on communic
communication you want to give each device like a million atoms so they can all do their thing
fo
r a while and and that dilutes the effect of communication as the bottleneck of your system
are the GPUs loaded sequentially I do not know I do know that the atoms are divided among the GPUs
according to a a approximately spatial partition lammps lammps divides the space of your simulation
into chunks and assigns a bunch of neighboring atoms to the same device so if if the atoms need
to talk to atoms that are in a different partition of the simulation then the two GPUs that are
in charge of
those atoms need to communicate with each other communication between GPUs is not
as fast as the GPU doing its own work so the more you partition your system the greater the surface
area between the partitions the more communication has to occur so with the same number of atoms and
you add more GPUs you're increasing the average amount of communication that needs to occur so
if the GPUs aren't doing enough work to make it worth that added computation then you don't see
any performance bene
fit that's why it's important to know the size of your system so that you know
that each device is getting allocated an adequate number of atoms to keep it busy so you don't
have to think as much about that communication bottleneck that's not specific to Intel or
GP that's that's really just a lammps thing in other simulation Frameworks there are other
considerations thank you for your good questions you've pretty much reached the end of my knowledge
about lammps I'm not a chemist so if you
have more questions about lammps it might be better to go to
the lammps community and ask them what they think with no more questions that brings us to the end
of our course I would like to share with you our acknowledgement slide this work was supported by
the National Science Foundation so here are some awards obviously the aces award which paid for
the accelerator test bed that we've been using but also we have some other NSF Awards we have
the sweter award Southwest expertise in expand
ing trading Education and Research sweter pays for
some of the development work for the educational materials and things like user user support if
people need to travel to to group meetings to collaborate on these user trainings if they're in
person sweder does those kinds of things within some within Texas and a few neighboring states
specifically and then we have the NSF award for faster with which faster is the flagship let
say Flagship it is a production cluster that focuses on composab
le nodes and a lot of what
we know about running code like AI and lammps we learned from when we were doing those tasks
on the FASTER cluster so when I say things like in my experience you need to assign enough atoms
to each device to keep it busy that experience comes from the time that I spent working with
the faster cluster so that means the that award did in some way contribute to this training we
have an Intel representative Doney aruki from Intel Who provided us with the software buil
d
capabilities for the for the AI and the lammps technologies that we explored today and also I
would like to thank the staff and students at Texas A&M hprc without whom training sessions
like this wouldn't be possible we it takes a lot of people to keep a cluster running and make
sure everybody has access to it in time to do your exercises I think that's the last slide
no it's not the last slide so here's our hprc helpes email help@hprc.tamu.edu that's
how you contact us if you have quest
ions if your job's not running you you just please
do tell us the basic information like what cluster you're running on and your username and
we would like to help you solve your problems okay okay that really is the last slide in the
slide deck I guess I should ask my colleague jenoa if he wanted to add anything else before
we dismiss the the class no thank you Richard and I really want to thank everybody who has
attended our short course introducing Intel PVC GPUs we also welcome collabor
ations on
like benchmarking our accelerators such as Intel PVCs and also you know we have various
types of accelerators if you have like your you want to speed up your sence with any
type of the accelerators available on ACES clusters you're welcome to contact us and
we are here to help you and also help us ourselves as well thank you thank you for for
joining for today's short course thank you very much
Comments