Main

ACES: Introducing Intel PVC GPUs

... data center GPU Max series PCIE card they have different GPUs like 1100 and 1550 on our ACES Computing platform we have ...

Texas A&M HPRC

5 days ago

okay let's get us started hello everyone  welcome to ACES introducing Intel PVC GPUs before we start I want to make sure  that everyone can hear me Tanishq can you confirm that you can hear me and also you  can see my screen yeah I can see your screen and I can also hear you okay thank you okay  hello everyone welcome to ACES introducing Intel PVC GPUs my name is Zhenhua He and I am  instructor for today's short course and also we have Akhil and Druva who also worked on these  course materials a
nd prepared for the course presentation slides and also Richard will be  the instructor for today's LAMMPS the Molecular Dynamics he will demo how to use the the LAMMPS  on our ACES Computing clusters using Intel PVC GPUs here the online for today's short course  first we will introducing the Intel PVCs on ACES we will briefly introduce the Intel's  PVC its architecture and the ACES Computing platform at Texas A&M High Performance Research  Computing second we will demonstrate how to run models
of different Frameworks such as Pytorch  and tensorflow with the PVC GPUs on ACES system here we will be using the Intel AI models we  will be running some like for example resnet 50 ACES with the PVC GPUs and after that you can  open the output file and see the see the like the performance of the GPUs for example the image  throughput the third part is about how to convert a pytorch model to run on PVC here we will show  you the steps how to convert the model and then you will do some Hands-On
exercises and run your  code on PVC GPU so we have prepared some Hands-On exercises for you so similarly for tensorflow  we will show you how to do that as well the last section is about LAMMPS on PVC this will  be Richard Lawrence he will demonstrate how to run the Molecular Dynamics simulations in the  LAMMPS framework with the PVC GPUs on the ACES system so the first lab is about introducing Intel  PVC GPUs on ACES and here's a picture about the Intel data center GPU Max series PCIE card they
  have different GPUs like 1100 and 1550 on our ACES Computing platform we have Intel Max 1100  GPU so first I would like to introduce our ACES Computing cluster ACES stands for Accelerating  Computing for Emerging Sciences this project is funded by NSF the mission of this project is  offer an accelerator test bench for numerical simulations and AI/ML workloads so on this  ACES Computing cluster we have various type of accelerators such as Intel PVC GPUs graph core  IPUs IPU stands for intellige
nce processing unit and also Nvidia GPUs and other FPGAs for example  Intel FPGA and bware FPGAs we have a different accelerators we welcome different researchers in  different research fields to use our accelerators to speed up your workflow second is to provide  consulting technical guidance and training to researchers we have been hosting different  workshops tutorials and short courses like this to researchers so they can learn from Hands-On  exercises to use the novel accelerators on our AC
ES Computing platform and also the third  menion is to collaborate on computation and data enabled research we have been collaborating  with different HPC centers such as San Diego Super Computing Center and also Texas Tech Advanced  Computing Center University of Florida on some benchmarking projects and also we publish some  papers on the our results here is a table of about the accelerators on ACES Computing  platform you can also check the quantities of the different accelerators on ACES Com
puting  cluster so we have 32 graph core IPUs and 16 of them are Colossus IPUs and 16 of them are Bow  IPUs Bow is a little more advanced IPUs than Colossus so we have both and also we have Intel  FPGA Bittware FPGA NextSilicon coprocessors NEC Vector engines but here we also have Intel Optane  SSDs if you have some application that requires a large memory these Intel Optane SSDs can be  addressed as memory with the memverge memory machine so we welcome you if you have any of  these application
that require large memory we would like you to test our Intel Optane SSD so we  have 18 terabyte we have 48 so and also we have Nvidia H100 we have 30 of these GPUs for NVIDIA  A30 GPUs for today we are using the Intel PVC GPUs this is the software development platform called  Intel Arctic sound ATS-P so we have 22 Intel PVC GPUs so this is a the Intel Max GPU 1100 on  our ACES Computing cluster it has one tile per stack per card and it has 56 Xe HPC cores  448 execution units so you can calcula
te it has eight units per core the TDP the thermal design  power is 300 watts and we use PCIE Gen 5 16 lanes card and it has 48 GB HBM2e memory the memory  bandwidth is 1.2 terabyte per second the Peak Performance is 22 Tera flops FP64 Precision at  this Precision so this is some specs about the GPUs if you are interested in knowing more  and I think we have more information there are more information on the Intel website  about the Intel Max GPU series even about the 1550 you can check that nex
t we would like  to talk about the Intel oneAPI tool kits it's a collection of development tools that are  developed by Intel so the users can develop their high performance applications across like  multiple architectures such as Intel CPUs GPUs and FPGAs and etc. and also they have some  add-on domain specific tool kits for example Intel oneAPI tools for HPC Intel oneAPI tools  for IoT and Intel oneAPI rendering tool kits for all of these domain specific tool kits  we will not use them today b
ut if you have any interest about these tool kits we encourage  you to look at the Intel documentation about all these tool kits so you can make use of them for  today we're going to use this one the Intel AI Analytics tool kit so we will run some machine  learning models with this Intel AI tool kit so to facilitate our users to use our ACES  Computing clusters and the accelerators we have created some shared directories on ACES  Computing clusters for example we save some data set for example h
ere image data set  for Pytorch and tensorflow and these are all processed data set image data set so you  can directly use them with your model if you want to train your model with the imagenet data  set you can directly use them and also we have tensorflow imagenet data set in TF record format  so and also for the models we have download the Intel AI models so you can try out the different  models and also we have provided containers in the shared directory as well so it's called Intel  Deep L
earning container and it has been converted to Singularity image so you can use it on  our Computing cluster if you're interested Next I would like to introduce some of the  resources we provide the first one is our web page Texas A&M High Performance Research  Computing if you click on this link you can see our now I'm at the web page of our TAMU  HPRC can you still see my web browser give me a thumbs up if you can see my Google Chrome  browser okay thank you Scott so you can see here we have l
ike how to do we have a lot of tabs  here and today we're going to use the tab to access the ACES by using ACCESS portal and also  you can even check the status of our Computing clusters Faster and Grace are under maintenance  today so you can see all the numbers are zero so the next one is about ACES quick start guide  if you click on it and we have ACES quick start guide here and this is our knowledge base for  the different Computing clusters you can see we also have a FASTER, Grace, and Terr
a so  if you click on the user guide we have ACES user guide here you can learn about different  accelerators for example Intel PVC GPUs here so it will have an introduction about the Intel  PVC GPUs our ACES Computing cluster we have six slurm nodes PVC nodes each with the four Intel  PVC GPUs and how to access these GPUs by using interactive method and also the slurm method and  also you can monitor the PVC GPU utilization by using some command for example sysmon xpumcli I  will introduce thes
e as well in my presentation and the demos as well so next we will be using  the ACES portal to access our ACES Computing cluster I think most of us are ACCESS users  here is the link to the ACCESS documentation tells you how to apply and ACCESS ID and  here's the HPRC YouTube video YouTube channel and we have posted I don't know how  many 94 videos about our short courses you can see for example this is about introduction  to Containers Charlie cloud and a hierarchical module system Introductio
n to Julia and  many others so if you are interested in knowing more about TAMU HPRC and learn about  our short courses you can subscribe to our YouTube channel and you're very welcome to look  at our videos watch our videos and learn from our resources okay this is our helpdesk email you  can send us a email if you have any questions so next is about how to access the ACES Computing  cluster we will be using the ACES portal which is the web-based user interface for the ACES  cluster it's based
on open on demand it's an advanced web-based graphical interface framework  for HPC users so this is the is open on demand portal next you will do authentication via  CILogon so you can see we will choose ACCESS CI XSEDE and also you will need to provide your  ACCESS username and password and then click log in after that I guess you will need to do the  Duo two Factor authentication you will need to approve for example on my phone I will have to  approve it so that's the process to get our ACES
cluster so next we will do the click on the tab  clusters ACES shell access will get on the shell you will you will be able to see the log-in known  name in the Prompt so let's do it together so I will show it now please follow along and then you  will be able to work on the Hands-On excises and then learn how to even run some demos with our  Intel PVCs so hover your mouse over the portal Tab and click on the ACES portal (ACCESS) this is  the ACCESS this will direct to so we will select the iden
tity provider ACCESS CI XSEDE and I'll  just click on log on so I have saved my ACCESS username and password you may need to type your  username and password so after that I click login here I will need to do the Duo two  factor authentication I will approve on my phone so now we are on the ACES ondemand portal  so we have you can see a different tabs here so before I continue I would like to know if  everyone is on the ACES ondemand portal now okay Elizabeth did you encounter any error or  do y
ou need it more time I needed time I have to get to this side I'm using another system so  that I can be looking at it I want to know which website can you repeat your question is it Texas  A&M High Performance Research Computing that is what I want to know which one to use on my  end from here which one to use we use the ACES portal okay let me okay now once you click  on the ACES portal as I did before you will need to choose the ACCESS CI XSEDE and then click log  on and then you will need to
provide your ACCESS username and password and then go through the  Duo two Factor authentication and a click login it should bring you to ACES on demand portal  and I would like to introduce the features of our ACES on demand portal for example here if  you click on the files tab for example this is my home directory and the scratch directory  if I click on that you can see the files I have and we will be downloading these I would  like to show you how to use the file editor on the portal so wh
en you see some files for  example here and you can use edit we will do this a little later as well because when  we edit I think it'll be easier for you to edit in this file editor if you prefer to  use vim of course you can use vim as well and the next one is about a job and we have active  jobs composer for example when you submit a job to our ACES Computing cluster you can check the  status of the job whether it's completed or it's pending or it's running so and also cluster  here we will be
able to start ACES shell and also we have different interactive apps here  the VNC and also we have NextSilicon VNC and if you use the Jupyter notebook Jupyter lab we  also have these Rstudio tensorboard we also have Affinity groups you are welcome to look at our  Affinity groups ACES group and our HPC group and a dashboard so probably this will show you some  a what's going on here so it will show you some of the disk usage the limit I think it's under  maintenance maybe it doesn't show the nu
mbers disk usage so you can try out these dashboards I think  maybe this one is probably the new one after that please click on the clusters and click on the ACES  shell access so we will be on the login node of the ACES Computing cluster I'm on the login  node one I think we have several login nodes the Clusters drop- down menu please cluster drop  down menu yes that's the one yes this is the one you should click on the underscore ACES shell  access so you click on this this will direct you to
here so now I'm on the login node login  node one so now we are all on the login node of ACES Computing cluster so this is a success  welcome to the ACES login node next we on the login node we can check the PVC slurm nodes  the status so for from here we can view the PVC nodes and the number number of GPUs so we  can use these commands if you have the course presentation slides you can directly copy and  paste the commands into the terminal so for example here I will do pestat -p means partitio
n  -g shows the number of the GPUs now you can see we have these nodes I guess six six PVC node slurm  nodes and they are in the PVC partition on each node we have four PVC GPUs and now the are their  state is reserved they have been reserved for the short course so I reserved them so today we  are going to use all these use the PVC GPUs next we will need to copy the training  materials to your personal directory first we will cd $SCRATCH to navigate  to your personal scratch directory and we ha
ve save the files for this short  course at scratch/training/aces_pvc_course so you will need to use this command cp -r  scratch/training/aces_pvc_course $SCRATCH to copy the the training materials to your  to your personal directory and then we you can enter your directory your local copy as well  so I think I will give I have download this into my scratch directory so I will give you like  two minutes to copy the training materials to your personal scratch directory so it should be  like this
the second lab is about using PVCs on the ACES Computing cluster these are mostly  demos but you are encouraged to follow along and around the the models like in pytorch  and tensorflow framework with the Intel PVC GPUs and also after that you will after  running the models you will get an output file that shows the performance of the Intel  PVC GPUs as well so here first we will need to set up the environment for the first one  we will do pytorch model so we will need to set up the environment
for pytorch models here  actually we have prepared a job file where the environment has been set up there so we will  go through line by line so you know how the environment was set up so first we will do  module purge to clean your environment and then we will module load the intel/AIKit  we use this version 2023.2.0 I think this is the latest version but I need to check. So this is the Intel AI Toolkit. This has been  installed as modules on our ACES Computing cluster so you can just simply do
module load Intel/AIKit  you can select a version and I think we have another version as well and then we will do module  load Intel toolchain and we will be using the 2023.07 next we set a environment variable called in name for for this demo  we will be using a pytorch so so we will be cloning the environment in the  AI tool kit into your directory so we give make a variable named aikit-pt means pytorch-gpu-clone  and if this this means like if the environment doesn't exist it will clone the
environment from  the ai tool kit so conda create -n $ENV_NAME --clone aikit-pt-gpu so this will help you to  clone the the the pytorch the environment from the AI tool kit and then we can source activate  the environment this is all in the pt_demo.slurm job file you if you have copied the training  materials to your personal scratch directory it should have been in the pytorch folder we  will go there and look at them together so okay I'll go back to the terminal and if you  list and cd pytorch
you can see the job file here so has everyone can you open the file in  the file editor pane instead of the vim yes that that is also okay if you use vim you can open the  file from here if you want to use our file editor that's also okay here I think at the beginning  you should be you can enter the files tab you can click on the scratch user and your username  if you click on this and I believe you have copied the aces_pvc_course here and we have three sub  folders one of them is pytorch if y
ou click on pytorch you will see some files ignore this one  because I run this job this morning so you can see I have some output file here but I want you  to open this slurm job file you can click on this button and then you can see the edit click on edit  you will be able to open it from here oh sorry so the first part is about the job specifications you  are requesting you will see the job name and the wall time is one hour requested and we request  a one task CPUs per task is eight and we w
ill use one node the memory is 100 gigabyte and the  output file will be pt_demo and followed by your job ID we will use the gres we will request a GPU  and we want the PVC GPU and one PVC GPU from here and from the PVC partition today as I said we  will be using the reservation called training so this is the resources that we requested and  the next part is I have explained to you about the environment setup so and then active the  environment this will cd to the resnet50 Intel AI models resnet
50 folder training so we will run  this training script this is prepared by Intel so we can just run it and see the performance of  the Intel PVC GPUs okay I will come back here do you have any questions so far can you see the  file in the file editor on the open on demand portal so next we will be running we  will submit a job to the ACES Computing cluster so for here sbatch do this command  sbatch pt_demo.slurm submit job to the ACES cluster of course you can use the pestat  -p pvc -G and we c
an do this I will just -n 5 maybe every five seconds access  okay we can see we have some students get on the using the PVC now so let's see  yeah we can see we have some users using the PVC GPUs that's great and I think my  job has been done I guess I can use squeue -u u.zh this is my net ID so you can see I have  no jobs running I think my job has been finished and the job ID is 29466 so we can go back to  the file editor I'll just refresh this page if I open this you can see it tells you are
using  the bfloat 16 Precision for training the resnet50 and also you are using the data set for training  because I didn't specify a data set directory so for the imagenet image net so it's using the  data set of course if you want to train the model on real data set you can specify the path  to the real data set and the resnet50 training running 1 stacks actually this is very small  because I only let it around like the first 20 iterations so it's very fast of course you  can let it run more E
pochs if you really care about the training get a better training model  so here for demo purposes I just let it run like 20 iterations so you can see and all these  and also training performance batch size we use 256 and the throughput is 1230 images per  second so this is the output file if you want to run more Epochs longer and you can do that  after the class in the file here let me back to the slurm you can see this file you can set the  path to the data set and also you can specify the Pre
cision that you want to run the model with  and also you can set set the batch size as well I just used some default parameters so this is just  for a demo purposes so that's great every some of some students are using the Intel PVC GPUs  that's a great start I think let me go back to my presentation slides actually I know some  users some researchers would like to prefer using the python virtual environment this is just for  another alternative for reference so please do not type these commands
into your terminal now because  we will not show the python virtual environment today you can try this after the class actually  I have tried all these commands already so it should work here I would like to point out that  if you want to install the torch and the Intel extension for Pytorch when you use the AI toolkit  actually this has already been installed for you so you don't have to care about you don't have to  worry about these but if you want to install them by yourself you have to car
e about the versions  so for here for example the torch is 1.13 and you will have to use the matching version of Intel  extension for pytorch now here also use 1.13 and what when you want to do the distributed  training you will also need to install another Library called oneCCL Intel one API Collective  communication library findings for pytorch also you need to find a version that is matching  with other packages these are all available on Intel website and you can find the matching  versions
of the different packages so if you if you want to try the python virtual environment  but today we we don't have time to work on these so we change to pytorch directory and then we  submit the job file already so that's great okay next is about tensorflow but it's 51 minutes now  probably I want to take a short break how about we take a seven minutes break and then we come  back at 11 so let's get back and at 11 and we will continue from the tensorflow models hello  everyone welcome back we wil
l continue with the tensorflow resnet50 model demo also first thing  first we also need to set up environment for tensorflow models and for here similarly we  also use the Intel ai kit so we just a clone the environment why we want to do a clone of  the environment because you have the freedom to install other packages into the environment  for your own project for example probably you need a like tqdm or tensorboard that that are not  included in the aikit-tf-gpu but you can when you clone the
environment you can source active the  environment and then you can install the module that you need for your own project so but for  today's I think for today's short course we don't need a lot more but for the future benefits so  and this is also in the tf_demo.slurm so so we can change to the change directory to the tensorflow  folder and then sbatch the tf_demo.slurm let's see so my terminal is still active so let's  see now I need to change to tensor flow and you can see the tf_demo.slurm s
o we use the  command sbatch tf_demo.slurm and you can see your job ID here 29487 this is my job ID and  also I can for example watch squeue I will try --me yes Richard is working thank you so  you can see my job is running ac026 PVC node in the slurm so it's working and also we can  watch pestat -p PVC -G so we have some users using these PVCs so it's running I guess this tensorflow  demo is a little longer it takes about five minutes so if you have any questions you can  let us know okay we ha
ve been using this user is using ac086 please give a you can either  thumb down if you have a problems running the jobs so while it is running I would like  to go back to my slides and see what is the next okay the next one is about converting  your pytorch code to run on PVC so this is the next one but I still want to see how many  students are on the PVC now we have a few more okay while it's running I will continue  the third lab pytorch on PVC so if you have a pytorch model pytorch code how
can you convert  it and make it work on the Intel PVC GPUs here we would first we would like to introduce Intel  extension for pytorch it's a python package for extending the pytorch models to run on a Intel  platform like the Intel PVC GPUs so first and first we will have you will have to add the  following import statement to the beginning of your script for example here import Intel  extension for pytorch as ipex so this is the first thing so we will have to incorporate to  make it work on In
tel PVC GPUs it's very easy right the first step and the second step is to  move your model and a criterion and to the xpu it's also easy so you can do just your model  you have created your model and criteria and then you just do model to xpu criterionsim to  xpu so move it to xpu and also after that you will need to apply the ipex optimize function  against the model and optimizer objects for here this is the arguments ipex optimize it will  take your model and up your Optimizer and here we us
e a data type bfloat(16) and actually it also  support fp32 you can try out to the bfloat16 here and after the class you can try the fp32 as well  because for the next when we do the fifth step we will be using the bfloat 16 so they have to be  consistent for here also the bfloat 16 will give a better throughput compared to fp32 so here we  choose bfloat 16 here the first step is move the data and target to xpu very similar in other like  a GPU code in the training loop you have to data when you
unpack your train data loader you have  the data data to xpu target target to XPU you so these are very simple change as I said the fifth  step we will use the Auto mixed the Precision here with the bfloat 16 data type we will use this  context manager torch xpu amp. autocast it accept the arguments enabled to and a data type to be  consistent with the previous one so torch bfloat 16 this is all the steps so this is just a few  lines of change in your code and it will be able to run on the Inte
l PVCs here I have prepared  a exercise file called a cifar10_pvc_todo.py I have listed these several steps in the to-do  file so you can use the our file editor of the open on demand portal you can add all these  changes to the file and read through the code and familiar familiarize yourself with how  to change the code the pytorch code to PVC code and after completing all the to-dos in the  Cifar10 PVC to-do file you can modify the slurm file and submit the job for example you can just  as you
will have to change the python command for example python you want to change the file name  to be cifar10_pvc_todo.py okay let me show you here I will go back to my pytorch folder so go  to the exercises folder exercises here you can see the todo file here cifar10_pvc_todo you can  use the of course you can use our file editor here I'll just close too many oh there has been  this is for the pytorch and click on the exercise folder and we have this todo file you can click  on this and then edit
so you can see Step One Import the package of Intel extension for pytorch  and give it an alias ipex write your code below so you can see import fix me you will have to  replace fix me with the correct code as shown in the slides so this is the first step and after  that you can read the code through and then in Second Step move the model and the Criterion to  xpu so now we have our model and the criterion in the second model or in the Second Step you  will need to do model equals model. 2 and t
hen xpu and the criterion spu so you have to replace  the fix me with the correct code so this will be the several steps you change it to to a PVC code  if I go back you can see we have a solution but I don't encourage you to look at the solution now  I want you to First Complete the the exercises first by yourself and then submit it to our  Computing cluster and also we have another file called cifar10_cpu this doesn't have the model  to xpu and all these lines or I just commented all these lin
es so the the the the pytorch model  will run on a CPU you it will take a long time I don't encourage you to run this file during this  class if you have interest you can after you can run it after the class so it takes several hours  but for the PVC it could just a few minutes so the we just want you to feel the speed up by  using the Intel PVC GPUs so now I will go back to this slide so I will give you like 10 to 15  minutes to work on these exercises the the to complete the to-do steps in thi
s cifar10_pvc_todo  after that you can submit the job I have prepared the solution I always want to use the vim but  I think I want to advertise our file editor here so you can see this is the solution  and the code changes are like you import the Intel extension for pytorch and also you  move the model and the criterion to xpu and we use the ipex optimize and you take the model  Optimizer and a data type and for here we in the training loop we move the data and target to  xpu and also we use th
is torch xpu amp AutoCast manager and then to wrap it so this is just a  few steps so you're welcome to try it on your own code okay I guess we will continue tensorflow  on PVC so tensorflow on PVC it also has a corresponding package called Intel extension for  tensorflow for pytorch we have in Intel extension for pytorch it's also this actually enable your  tensorflow model to run in xpu like GPU CPU Etc and also here you can check the version of  course using -- double underscore version and t
o print the version you used good thing is  that if you install Intel extension for pytorch in your environment the default device will  be the Intel GPU for example if you have the Intel PVC if you installed this package in your  environment it will automatically use the Intel PVC GPU you don't have to do any code changes so  that is a great news right that's a great news for tensor flow so if you use tensor flow you just  need to install the Intel extension for tensor flow into your environmen
t and then you don't  have to make any code changes it will pick up the Intel PVC GPU and run your model with it so  so this is a great news so this is for great news for the tensorflow users here is a very simple  exercise you don't have to change anything you just needed to actually modify the slurm Jobfile  tf_cifar10_pvc.slurm and submit your job you can see it runs on the PVC already so this is a  screenshot of the job file you you can add comment these two lines and to let it run on the PV
Cs  so let me go there and try that this is what is a change tensorflow and I guess it's in  the exercises so we have the slurm file here I'll just use name so you can see please and comment these two lines oh yes here and then submit this file I have  a job ID number 29502 I'll just watch squeue --me it's pending now request node not  available maybe let me see what's happened there okay because all the nodes have been reserved  so if you do not put the reservation your node will not be able to
find it so let's let's add  this sbatch directive here -- reservation equal training so let's add this line and try that again so because it cannot find a node it can use so now I have the new job here  so because it's in I specified to use the reservation reservation so  he can find it and I'm using the ac086 it's running so it's just a line I will  check you can also check the PVC dash G oh okay just so we have some users using these  PVCs that's a great news and I myself is using the ac086 s
eemingly I'm a very alone on this  node I need somebody to get on this Node as well so okay this is very easy and I would  like to actually look at the the file so you can see actually we have imported the Intel  extension for tensor flow but actually it doesn't have any these use any of these it  doesn't use the the itex at all so I think if you have the the Intel extension for  tensor flow installed you don't even have to put this line of code to it let me try I  I'm very curious about this if
I do not use these two lines can it still be landing  on the PVC nodes or it will give me some first let me see let me sbatch TF_ okay oh  sbatch so 29507 I will just watch squeue --me so it's also running ac086 how can I check if  it's really using the PVC GPUs we want you to always follow the good practice and we want  you to probably when you run this you want to use the VNC for checking the PVC usage I  will introduce it a little later but we can see actually it's it's landing on the PVC  n
ode do you have any questions so far for the tensor flow tensor flow is a easy easy  one right so you don't have to do any code changes okay I think I will move on we have  some PVC monitoring tools I just introduced the pestat -p partition PVC -G to show the number  of GPUs and this is one you can use this one to check the availability of the PVC GPUs and also  actually we can monitor the PVC system activity use the sysmon it will tells you the memory being  used on the PVC and other informatio
n as well another one is to use the Intel xpu manager it's  xpu m manager CLI command you can use the stats dash d and you can give it a device index for  example you you use zero and you can give it zero and you you will be able to Show the memory used  and GPU utilization oh because sysmon if you just use it now I guess it's it's not on the you are on  the login node now so so it's not on the PVC node so sysmon error it's it's expected so only if you  data on a PVC node you can use this sysmon
here is how we can directly land on a PVC node for example  we can use the VNC you can start a VNC I have introduced the about the interactive apps so you  can click on VNC choose the node type for example Intel GPU Max and the number of GPUs for example  here I choose one if you have distributed training you can choose more and for example two three or  four and number of hours number of cores total memory and after that you will be like you will  after a short wait you you will be able to see
this blue button called the launch VNC click on  the launch VNC will be bring you here for example in the VNC we can run some jobs I don't know why  I just covered my user username so I think that's okay and then here I change directory to the to  the pytorch exercise folder and I do some module load of the Intel AI tool Kit and also module  load of the Intel toolchain i also activate our cloning environment called aikit-pt-gpu-clone and  here on this note I run this cifar10_pvc solution file a
nd after that you can use watch sysmo to  see the memory usage and this shows your process is attached to this GPU this is a group practice  to check the PVC memory usage and but if you want to check the power and temperature of the GPU you  probably want to run the command xpumcli stats -d device but for today I guess I will just show this  I did this before the I think it it was yesterday yes it was yesterday I tested and you can also  use the hour vnc session to check the PVC usage I think th
at's all for the AIML part in pytorch  and tensor flow next I would like to invite my colleague Richard Lawrence to do some to show some  demonstration and maybe exercises on Runing lammps the molecular Dynamics simulation on PVC GPUs  Richard are you there hi this is Richard I'm a user support specialist at Texas A&M and I will  be talking to you about running lammps on PVC GPUs I hope you can see my screen it should  look pretty much the same as the other one so what is what is lammps has anyo
ne heard of lammps  before lammps is a molecular Dynamics simulation engine and lammps has a modular back end for GPU  acceleration so there are two main lammps packages that serve that that purpose the GPU package  uses OpenCL or Cuda to communicate with GPUs and this this package divides work among the GPU  and the CPUs there's also the Kokkos package which uses a different type of declaration language  entirely and that supports many more types of accelerators including different types of CPU
s  the default strategy if you're using a GPU with the Kokkos package is to put every calculation  on the GPU I've used both of these they they do both work well today we will be focusing on the  GPU package which divides work among the GPU and CPUs for other purposes the Kokkos package  may be better I just don't have an example prepared what's different between these two is  primarily there's a slight change of syntax when you launch your executable depending on which  package you're using to
communicate with the GPUs mostly your input files are not going to  vary very much lammps is pretty good about you know behaving as though this is a modular back  end so ACES we provide a build of lammps that's compatible with the Intel GPUs and I have put it  in a module so that is a the lammps module which has the version named 3August2023-Intel-2023.07  so here the first phrase 3Aug2023 refers to the build of lammps itself so that is a release  of lammps from that time and the second phrase I
ntel 2023.07 refers to the version of the Intel  compilers that were used to compile the lammps code into executables and because the module  system on ACES is hierarchical you must load your tool chain before you load your application  so the syntax is module load Intel 2023.07 then module load lammps 3August2023-intel-2023.07  however the because there are only a few things available on Aces that are specific to this  tool chain it's actually good enough to just say module load Intel 2023.07 l
ammps and  there's only there's only one reasonable option that the module system can consider so  it it picks the right one so it saves you a few keystrokes where is this installed when  you load the module it sets an environment variable named hprc root lammps which just  points to the directory where this build is installed so we can take a look at it and  see where it is you could consider this to be a small exercise if you wish module load  Intel/2023.07 LAMMPS sorry it's hard to see on my
screen I'm trying to type an underscore  but it's kind of invisible don't know what happened so this particular build of lammps  provides an executable named lmp_oneapi if you recall oneapi is Intel's name for the  collection of compilers that are meant to be used across multiple types of devices it is  common practice in the lammps community to name your executable after the type of devices it is  meant to be used with so for example you might see lmp_MPI or you might see lmp_Cuda these types 
of things are common so Nvidia has followed that Paradigm and they've decided that this executable  will named lmp_onempi otherwise other than the name change it functions identically to all  those other lammps executables you may have seen so how do we how do because I said that this  build of lammps distributes work between the CPUs and the GPUs that means we will be using MPI as  the layer for communicating among the CPUs I've discovered that MPI doesn't play so well with srun  it might be a
bug or maybe I just don't understand how srun Works but in order to get the correct  performance we're going to be using sbatch so sbatch file name and somewhere in the file there's  your MPI run command so the Syntax for mpirun is -np for a number of processes and then and then  and and here represents any variable number of processes for the MPI thread configuration we have  here some suggested variables I don't know what all of these do because I'm not actually an MPI  expert I got these from
our Intel representative who suggested these so we're going to try them  for this demo it is not known to me if there is a better combination for your specific use  case one final new thing is called the compute aggregation layer this is a tool that Intel  released which is a it's a helper for when you are running applications that use both MPI  and openCL so for example the lammps GPU package that we're using today is an example that uses  both MPI and openCL so naively if you MPI run a bunch
of processes then every process can send  openCL calls to any of those GPUs that might result in communication bottlenecks or other  inefficiencies So Cal the compute aggregation layer works as a middleman between your MPI  processes and your openCL devices it collects incoming openCL calls from the MPI processes and  more intelligently handles them out to the open Cal devices to reduce communication bottleneck  that could occur due to the MPI processes not not respecting each other very well so
so to speak  now the the utility of this of course entirely depends on how much work Intel puts into making  sure that it makes good choices I'm not really an expert on that I just know that we're going  to be using it today because it is essentially recommended by Intel so let's give it a try in  order to use Cal run you just prepend your MPI run with the Cal run executable that sets up an  environment so that it's going to intercept those calls from MPI processes redistribute them to the  GPU
s intelligently Cal run executable is provided with the lammps module so you do not need to load  any additional modules it is already in that same installation area in order to Benchmark lammps we  have some test files that were provided by Intel so these are Intel's variant of common molecular  system input files for lammps I know what some of these are for example on the bottom left we  have LJ that stands for Leonard Jones which is basically the billiard balls model the atoms can  bounce off
of each other when they are in close contact and that's it and on the top right  we have rodo which is short for rhodopsin is a protein that is solvated in a bilipid layer  with ions it's just about as complicated of a molecular system as you can design so we run the  full gamut here of very simple to very complex and you can try any of these out the default one  is LC it's actually one I don't know much about but we can mix and match for fun so try any of  these that you think are relevant to
you the test files are found in hprc root lammps/apps/TEST so  let's navigate to that directory and see what's there yep those those files are there I'll actually return to the previous I was  in my home directory so I guess that's not very helpful okay so I have provided a demo  slurm batch file which probably should be in the directory that you already copied so  earlier we made a copy of the ACES PVC course and one of the three directories in it is  named lammps and in that lammps directory t
here's just this file the lammps demo.  slurm so let's take a look at what's in this file so I here have my ACES file  Navigator open and I have navigated to the ACES PVC course lammps directory and  in it you can see the lammps demo. slurm file so I'll just click the view button to open  this in a tab so we can read it and go over what is in here so the these top few lines are normal  looking slurm parameters things like we want to be on one node we need some memory we need some  time and in th
is demo I've picked one GPU and 16 cores so as your exercise you may choose to vary  these recall I said that the lammps GPU package will distribute work among the GPUs and the CPUs  so as you vary these numbers you will see varying performances here's some printing out  some fun facts about the node that your job will land on I find Handy here's where  we load those modules load Intel 2023.07 load lammps 3August2023-intel-2023.07 I  create a variable for the output because lammps does produce a
log so we'll just have  a log directory and create it to make sure that it exists here's the MPI communication  settings I previously told you will be that were suggested by Intel I don't have a reason  to change these so I've just put them here if you're an MPI expert you might choose to play  with those and see what happens next we need to navigate to the directory where those  input files are located in order to use them this now this CD will happen in the  running job though this will not h
appen on your command line you're in your command  line You'll simply be located in the folder where the slurm file is and then I've created  a variable named in which I just choose to be one of the a input files that is located  here in the test directory so this default that I picked for you is LC perhaps we  might be interested in what is LC I don't know it says it's biaxial ellipsoid mesogens  in isotropic Phase I know what the word phase means the rest of that is a mystery to me what  were
we doing we were reading this file okay so let's read these lammps arguments if you're not  familiar with molecular Dynamics I can give you a real quick rundown on what these options do  the first option is obvious Dash n just tells which input file we will read -V is it sets a  variable the variable's name is capital N the Intel input files all have this variable defined  which turns on and off the the Newton parameter of molecular Dynamics so in molecular Dynamics  you have to say you're Compu
ting the forces on 100 atoms so first you'll compute the force on  atom number one due to atom number 99 and then at a later time you compute the force on atom  number 99 due to atom number one if you're in serial or in a shared memory environment you  might notice that those are actually the same magnitude so if you record the magnitude the  first time you calculate it then the second time you need to calculate it you can just read  it out of memory in save time but we are not in a shared memor
y environment we are Distributing work  of across multiple devices including GPUs and CPUs So reading the result of the force calculation  from the previous time that it was done isn't actually faster it's actually slower it it is  faster to Simply recompute the force a second time so we turn off Newton's third law here and  we say we're not going to assume that the force on a is the same as the force from a because  it's faster to just recalculate so off PK is which package we're using for our
which package  we're using to to access GPUs we're using the GPU package not the cocost package if you were using  the Kokkos package you then this would be PK KK for Kokkos and then there's an additional argument  when you use the GPU package package which is how many GPUs so because we previously defined how  many GPU GPUs up here with the gres parameter instead of writing that number down again I'm  going to read it out of slurm using the slurm variable slurm GPUs on node that way if you deci
de  to change this number up here lammps will get the updated number of GPUs that are assigned to  this job then we have the flag SF GPU which which means that we're going to be assigning Force  calculations to the GPU I believe that's what that means so some amount of work will be assigned to  the GPU and the other calculations will be done on the CPUs lammps will output two different  ways it will either output to a log file or it will output to the screen we don't need both  of them so we're
turning off one of them we're not and we're going to set the other one to the  output file in an output directory we previously made and for the special case of rhodopsin  there's an additional parameter named D which has to do with diffusion and we're  just making sure that it's set to zero and I don't even know what it does sorry about  that so the last part of the job is we have to put all that together we have to say  we're going to do a calrun mpirun number of processes equal to the number
of tasks  that we request it up here so every core gets assigned one MPI task lmp_onempi  and then all those arguments that we saved so that's that's that's what this  file does are there any questions about what's in this file or would you like  some time to edit the file otherwise the next thing we're going to do is  submit the file to the batch system and see what it does because I did not put  the reservation flag in the batch file if you wish to use the reservation you will be  providing th
e reservation flag on the command line instead so it's sbatch --reservation  equals training space the name of the slurm file with new how much faster with Newton's third  law off I don't know for every case whether it's much faster or not with Newton's third law off  that's one of those things where you you would need to do the experiment to find out it's  off because at least in theory it's better to be off when you're using GPUs but whether it's  actually better you would have to try it and f
ind out in my experience it depends on what kind of  system some systems don't have a lot of force calculations for example the Lennard Jones the  billiard balls model there are very few Force calculations so Saving Time on a handful of  force calculations doesn't really matter but then you have like electrostatic models like EAM  the embedded atom model where there are a lot of force calculations so having Newton's third law  matters more for those cases and no I don't know all of them I'm not
really a chemist I just know  how to push the button that makes it go how many GPUs per node are they all in use we can double  check pestat -p pvc -G so we have here several nodes that are reserved today each one has four  PVCs on a node so you can request a number of PVCs from one to four for this experiment the  default in the file was four was one but you can change that can lammps use different GPUs on  the same node for distinct MPI ranks in theory yes however because we're using calrun as
the  integration layer between MPI and the GPUs we are not really in control of that instead we're  deferring to Intel on their expertise for the best use of MPI with their GPUs if you didn't  use calun you could manually specify how your MPI ranks are getting assigned to GPUs in the  case of for example the Kokkos package where all of the work is going to be moved onto the  GPUs it is better to have just one MPI rank per GPU so they're not competing with each other for  access to the GPUs for
the GPU package where work is being divided among the GPUs and the CPUs it's  more complicated and in general I don't know the answer someone is pinging me oh that's unrelated what were we about to do I  forgot already thank you Carl for your good questions I'm sorry I don't know  six ranks they are in parallel so MPI run launches the 16 processes together  we can explore that we can explore that as we submit some jobs do we have time I  think we do yes so we'll explore that Carl after we submit
some jobs so I just need  to go to my directory where I have these files does everyone have a directory  like this with this file so I'm going to do sbatch --reservation equals training  lammps demo.slurm there is an underscore in the file name just for some reason it's  not showing on my screen and I don't know why yeah I don't know that's weird so I submitted batch job 29518 so I have been assigned to node AC026 and my  job is running so ac026 psux oh looks like I'm too late yep the job is al
ready done if you want to  catch it in action we'll have to make it run for longer so let's take a look at the output so we  get two outputs here we have a lammps demo. job ID number which just has a handful of statements  explaining what was run I always like to print out these statements because if I come back  later and I forget what I was doing it can be very hard to figure out what I was doing so here  we have a record now that this job ran on ac026 and we had 16 processes and the log file
went  here so we could take a look at that log file if you know how to read lammps log files this  particular problem statement had 32,000 atoms and the the 16 MPI processes were divided into a  4x2x2 grid of MPI processes then because there's a replicate statement the system is replicated 16  times up to 500,000 atoms that's standard practice for lammps benchmarking we can scroll down a bit  and you can see that here there is one GPU device that was initialized it happens to be the device  numb
er zero if we scroll down a bit more you can see that the 16 MPI processes are all able to  see device zero if we scroll down some more we can see that the performance over oh sorry this  was the warm-up 10 steps if we scroll down some more you can see that for the 840 steps after the  warm-up 10 steps the performance was 41 million atom time steps per second that's a a standard  unit of performance for lammps benchmarks we can see that this one spent most of its time doing  pair calculations an
d other which probably has to do with waiting for communication between the  devices the total wall time was only 19 seconds which is why I had I didn't catch it live we  can try again and I'll try to catch it live sbatch squeue same node there we go so if I hit  it with ps while it's running you can see that there are many many copies of lmp oneAPI that's  because the MPI run launches many copies of the executable and it's the job of MPI and Cal to  make sure that all these processes are talkin
g to each other correctly I believe that means  we have enough time to try some experiments so is there another combination you would like to  try perhaps a different resource count more GPUs perhaps a different molecular system one besides  the LC I don't have a specific exercise except we can make some changes to this ex to this demo  file and see how see how what the how the result is different is there an optimization command  we could test what were the choices again let's review the choice
s oops so if we look here at our  lammps demo. slurm let's suppose I go to edit the file the first few parameters are requesting  resources so one option we could do is we could change what resources we request we could request  a different number of PVC GPUs any number from one to four I don't have a multi-node example setup  so it will have to be on a single node we could also change the number of tasks from one to I  think definitely 64 is fine I don't remember how many cores are on the Node
is there 96 cores on  a node does anybody know but I've tested up to 64 and it worked fine and then the other option was  we could pick a different input file so here I we picked LC we could pick one of these other ones  instead airebo dpd eam lj rhodo snap sw tersoff water can you specify which GPU to use instead  of lammps pick automatically because the GPUs are allocated by slurm it's not a good idea  to try to manually pick a GPU because if the if slurm gives you a different GPU than the one
  you hardcode into your file it your your process won't be able to launch at all generally we let  lammps pick the GPUs but to illustrate what I mean why don't we just change this number from one  to something else for example we could change it from one to three so if I click save here now  if I launch that it's going to try to use three GPUs and we can see how the output is different  Which choice has the most atoms I don't know which Choice has the most atoms if I had to guess I  would say t
hey all have about the same number of atoms the standard practice is to pick a a  system and then replicate it up to a specific size so for for small for crystals that only  have like one atom per unit cell you replicate it 500,000 times but for something like rhodopsin  you don't replicate it at all instead you just add water molecules until you get to the Target atom  count that's a standard practice in the lammps community but I don't know for a fact how many  atoms are in each of these files
I didn't explore all of them so we look in the logs we can see  there's now a log file for the job that ends in 23 so if we explore that we can see what's  different same number of atoms but now it says that it found three devices numbered 0  to 1 and two it did not find device number three that's because slurm arbitrarily picked  for this job to have device 0 one and two and to not have device 3 next we can see that it  initialized devices 0 to 2 On Core zero that means all three of the GPUs a
re visible to core  zero all three of the GPUs are visible to core one if this was all there was to the story you  would have a huge communication bottleneck as every core tries to talk to every GPU so hopefully  something is going to stop that from crashing your system and I believe the meaning of it  is that Calrun is going to stop that from crashing the system because calrun is  going to intercept all of the openCL calls from the cores and redistribute them among  the GPUs intelligently so we
don't have that bottleneck we can see that the performance  increased from from 41 to 49 million atom time steps per second so with three times  as many GPUs it took a little bit less time the the the communication time increased  dramatically it used to be mostly pair and other now the communication time and the modify  times are more significant so what that's telling me is that the additional work done by the  additional GPUs was kind of negated by the additional communication time that was
needed to  to get those GPUs to work with each other these are the kinds of things you need to be aware of  when you have lots of choices of ways to divide work among devices the wall time was overall  less but not by a lot with nodes equals one you're limited to tasks equals cores tasks equals  cores is just one way to ask slurm for Cores you can also ask for CPUs per task you can also ask  for all the cores on the machine and a number of task is just my favorite way to request cores  there is
more than one way in slurm to specify your resource that you need usually they're  they're equivalent lammps for example doesn't care about the distinction between a task a  process or a thread it it will just choose it will just launch processes and assign them  to cores as best it can but there are other applications where the distinction between a task  and a thread actually matters but not for lammps specifically if you request n tasks equals number  of cores plus one I don't know what happe
ns if you request contradictory resource requirements  from slurm I'm guessing that it has some kind of internal tiebreaker and it just ignores the  parameter that it doesn't understand also you have to be careful because some of the slurm  parameters describe the node you want to land on and some of the slur parameters describe the  resources that you need so you could say I want to be on a node that has 96 cores but I only  want to use two of them that's something you can do in slurm so just y
ou can't just assume  that all of the all of the parameters get you what you need that's why I try to stick with  just using end tasks because I understand what it does so I don't have to worry about slurm  jobs that don't that don't work properly you could see increase the nodes tasks and see  what would be the performance performance boost you and Shen would like to know what  happens if we vary the number of tasks would you like to suggest a different number of tasks  than 16 perhaps double t
o 32 shall we double to 32 and see what happens do you want me to put  the number of GPUs back to one or leave it at three GPU is back to one okay so  I clicked save the button changes to darker color after it's saved  and then we can submit that demo again and it is running on AC026  all my jobs are landing on AC026 today okay and it is done so now  we can investigate the log for the job that ends in 25 and see how it's different you can see that initialized more processes and the number of ato
m times steps per  second actually decreased so it used to be 41 atom times million atom time steps per second and now  is 31 there's a couple possible explanations one is there weren't enough atoms in this Benchmark to  keep 30 to cores busy if you initialize more cores but don't give them any work to do you're just  essentially wasting time initializing cores so in order to put 32 cores to work maybe we needed  to have more atoms that's one possible explanation another possible explanation is
maybe the GPU is  getting hot because I used it repeatedly you can see I keep landing on device zero on node AC026 so  it's possible that that the node just is too warm now but I haven't really gone into detail about  about measuring the temperatures Carl asks could you re-elaborate more on why the comms overhead  is so much greater with three versus one GPUs I'm not a lammps expert I just know how to push  the button to make it go but my experience with lammps is that you you want to give every
device  that's going to do calculations enough atoms that it's going to be busy doing calculations for  a while because communication between devices tends to be the slowest part of simulations so if  you if you give a CPU or a GPU like 10 atoms and then immediately say hey give me the results back  you're spending too much time communicating you're going to bottleneck your simulation on communic  communication you want to give each device like a million atoms so they can all do their thing  fo
r a while and and that dilutes the effect of communication as the bottleneck of your system  are the GPUs loaded sequentially I do not know I do know that the atoms are divided among the GPUs  according to a a approximately spatial partition lammps lammps divides the space of your simulation  into chunks and assigns a bunch of neighboring atoms to the same device so if if the atoms need  to talk to atoms that are in a different partition of the simulation then the two GPUs that are  in charge of
those atoms need to communicate with each other communication between GPUs is not  as fast as the GPU doing its own work so the more you partition your system the greater the surface  area between the partitions the more communication has to occur so with the same number of atoms and  you add more GPUs you're increasing the average amount of communication that needs to occur so  if the GPUs aren't doing enough work to make it worth that added computation then you don't see  any performance bene
fit that's why it's important to know the size of your system so that you know  that each device is getting allocated an adequate number of atoms to keep it busy so you don't  have to think as much about that communication bottleneck that's not specific to Intel or  GP that's that's really just a lammps thing in other simulation Frameworks there are other  considerations thank you for your good questions you've pretty much reached the end of my knowledge  about lammps I'm not a chemist so if you
have more questions about lammps it might be better to go to  the lammps community and ask them what they think with no more questions that brings us to the end  of our course I would like to share with you our acknowledgement slide this work was supported by  the National Science Foundation so here are some awards obviously the aces award which paid for  the accelerator test bed that we've been using but also we have some other NSF Awards we have  the sweter award Southwest expertise in expand
ing trading Education and Research sweter pays for  some of the development work for the educational materials and things like user user support if  people need to travel to to group meetings to collaborate on these user trainings if they're in  person sweder does those kinds of things within some within Texas and a few neighboring states  specifically and then we have the NSF award for faster with which faster is the flagship let  say Flagship it is a production cluster that focuses on composab
le nodes and a lot of what  we know about running code like AI and lammps we learned from when we were doing those tasks  on the FASTER cluster so when I say things like in my experience you need to assign enough atoms  to each device to keep it busy that experience comes from the time that I spent working with  the faster cluster so that means the that award did in some way contribute to this training we  have an Intel representative Doney aruki from Intel Who provided us with the software buil
d  capabilities for the for the AI and the lammps technologies that we explored today and also I  would like to thank the staff and students at Texas A&M hprc without whom training sessions  like this wouldn't be possible we it takes a lot of people to keep a cluster running and make  sure everybody has access to it in time to do your exercises I think that's the last slide  no it's not the last slide so here's our hprc helpes email help@hprc.tamu.edu that's  how you contact us if you have quest
ions if your job's not running you you just please  do tell us the basic information like what cluster you're running on and your username and  we would like to help you solve your problems okay okay that really is the last slide in the  slide deck I guess I should ask my colleague jenoa if he wanted to add anything else before  we dismiss the the class no thank you Richard and I really want to thank everybody who has  attended our short course introducing Intel PVC GPUs we also welcome collabor
ations on  like benchmarking our accelerators such as Intel PVCs and also you know we have various  types of accelerators if you have like your you want to speed up your sence with any  type of the accelerators available on ACES clusters you're welcome to contact us and  we are here to help you and also help us ourselves as well thank you thank you for for  joining for today's short course thank you very much

Comments