ACES: Introducing Intel PVC GPUs

okay let's get us started hello everyone welcome to ACES introducing Intel PVC GPUs before we start I want to make sure that everyone can hear me Tanishq can you confirm that you can hear me and also you can see my screen yeah I can see your screen and I can also hear you okay thank you okay hello everyone welcome to ACES introducing Intel PVC GPUs my name is Zhenhua He and I am instructor for today's short course and also we have Akhil and Druva who also worked on these course materials a

nd prepared for the course presentation slides and also Richard will be the instructor for today's LAMMPS the Molecular Dynamics he will demo how to use the the LAMMPS on our ACES Computing clusters using Intel PVC GPUs here the online for today's short course first we will introducing the Intel PVCs on ACES we will briefly introduce the Intel's PVC its architecture and the ACES Computing platform at Texas A&M High Performance Research Computing second we will demonstrate how to run models

of different Frameworks such as Pytorch and tensorflow with the PVC GPUs on ACES system here we will be using the Intel AI models we will be running some like for example resnet 50 ACES with the PVC GPUs and after that you can open the output file and see the see the like the performance of the GPUs for example the image throughput the third part is about how to convert a pytorch model to run on PVC here we will show you the steps how to convert the model and then you will do some Hands-On

exercises and run your code on PVC GPU so we have prepared some Hands-On exercises for you so similarly for tensorflow we will show you how to do that as well the last section is about LAMMPS on PVC this will be Richard Lawrence he will demonstrate how to run the Molecular Dynamics simulations in the LAMMPS framework with the PVC GPUs on the ACES system so the first lab is about introducing Intel PVC GPUs on ACES and here's a picture about the Intel data center GPU Max series PCIE card they

have different GPUs like 1100 and 1550 on our ACES Computing platform we have Intel Max 1100 GPU so first I would like to introduce our ACES Computing cluster ACES stands for Accelerating Computing for Emerging Sciences this project is funded by NSF the mission of this project is offer an accelerator test bench for numerical simulations and AI/ML workloads so on this ACES Computing cluster we have various type of accelerators such as Intel PVC GPUs graph core IPUs IPU stands for intellige

nce processing unit and also Nvidia GPUs and other FPGAs for example Intel FPGA and bware FPGAs we have a different accelerators we welcome different researchers in different research fields to use our accelerators to speed up your workflow second is to provide consulting technical guidance and training to researchers we have been hosting different workshops tutorials and short courses like this to researchers so they can learn from Hands-On exercises to use the novel accelerators on our AC

ES Computing platform and also the third menion is to collaborate on computation and data enabled research we have been collaborating with different HPC centers such as San Diego Super Computing Center and also Texas Tech Advanced Computing Center University of Florida on some benchmarking projects and also we publish some papers on the our results here is a table of about the accelerators on ACES Computing platform you can also check the quantities of the different accelerators on ACES Com

puting cluster so we have 32 graph core IPUs and 16 of them are Colossus IPUs and 16 of them are Bow IPUs Bow is a little more advanced IPUs than Colossus so we have both and also we have Intel FPGA Bittware FPGA NextSilicon coprocessors NEC Vector engines but here we also have Intel Optane SSDs if you have some application that requires a large memory these Intel Optane SSDs can be addressed as memory with the memverge memory machine so we welcome you if you have any of these application

that require large memory we would like you to test our Intel Optane SSD so we have 18 terabyte we have 48 so and also we have Nvidia H100 we have 30 of these GPUs for NVIDIA A30 GPUs for today we are using the Intel PVC GPUs this is the software development platform called Intel Arctic sound ATS-P so we have 22 Intel PVC GPUs so this is a the Intel Max GPU 1100 on our ACES Computing cluster it has one tile per stack per card and it has 56 Xe HPC cores 448 execution units so you can calcula

te it has eight units per core the TDP the thermal design power is 300 watts and we use PCIE Gen 5 16 lanes card and it has 48 GB HBM2e memory the memory bandwidth is 1.2 terabyte per second the Peak Performance is 22 Tera flops FP64 Precision at this Precision so this is some specs about the GPUs if you are interested in knowing more and I think we have more information there are more information on the Intel website about the Intel Max GPU series even about the 1550 you can check that nex

t we would like to talk about the Intel oneAPI tool kits it's a collection of development tools that are developed by Intel so the users can develop their high performance applications across like multiple architectures such as Intel CPUs GPUs and FPGAs and etc. and also they have some add-on domain specific tool kits for example Intel oneAPI tools for HPC Intel oneAPI tools for IoT and Intel oneAPI rendering tool kits for all of these domain specific tool kits we will not use them today b

ut if you have any interest about these tool kits we encourage you to look at the Intel documentation about all these tool kits so you can make use of them for today we're going to use this one the Intel AI Analytics tool kit so we will run some machine learning models with this Intel AI tool kit so to facilitate our users to use our ACES Computing clusters and the accelerators we have created some shared directories on ACES Computing clusters for example we save some data set for example h

ere image data set for Pytorch and tensorflow and these are all processed data set image data set so you can directly use them with your model if you want to train your model with the imagenet data set you can directly use them and also we have tensorflow imagenet data set in TF record format so and also for the models we have download the Intel AI models so you can try out the different models and also we have provided containers in the shared directory as well so it's called Intel Deep L

earning container and it has been converted to Singularity image so you can use it on our Computing cluster if you're interested Next I would like to introduce some of the resources we provide the first one is our web page Texas A&M High Performance Research Computing if you click on this link you can see our now I'm at the web page of our TAMU HPRC can you still see my web browser give me a thumbs up if you can see my Google Chrome browser okay thank you Scott so you can see here we have l

ike how to do we have a lot of tabs here and today we're going to use the tab to access the ACES by using ACCESS portal and also you can even check the status of our Computing clusters Faster and Grace are under maintenance today so you can see all the numbers are zero so the next one is about ACES quick start guide if you click on it and we have ACES quick start guide here and this is our knowledge base for the different Computing clusters you can see we also have a FASTER, Grace, and Terr

a so if you click on the user guide we have ACES user guide here you can learn about different accelerators for example Intel PVC GPUs here so it will have an introduction about the Intel PVC GPUs our ACES Computing cluster we have six slurm nodes PVC nodes each with the four Intel PVC GPUs and how to access these GPUs by using interactive method and also the slurm method and also you can monitor the PVC GPU utilization by using some command for example sysmon xpumcli I will introduce thes

e as well in my presentation and the demos as well so next we will be using the ACES portal to access our ACES Computing cluster I think most of us are ACCESS users here is the link to the ACCESS documentation tells you how to apply and ACCESS ID and here's the HPRC YouTube video YouTube channel and we have posted I don't know how many 94 videos about our short courses you can see for example this is about introduction to Containers Charlie cloud and a hierarchical module system Introductio

n to Julia and many others so if you are interested in knowing more about TAMU HPRC and learn about our short courses you can subscribe to our YouTube channel and you're very welcome to look at our videos watch our videos and learn from our resources okay this is our helpdesk email you can send us a email if you have any questions so next is about how to access the ACES Computing cluster we will be using the ACES portal which is the web-based user interface for the ACES cluster it's based

on open on demand it's an advanced web-based graphical interface framework for HPC users so this is the is open on demand portal next you will do authentication via CILogon so you can see we will choose ACCESS CI XSEDE and also you will need to provide your ACCESS username and password and then click log in after that I guess you will need to do the Duo two Factor authentication you will need to approve for example on my phone I will have to approve it so that's the process to get our ACES

cluster so next we will do the click on the tab clusters ACES shell access will get on the shell you will you will be able to see the log-in known name in the Prompt so let's do it together so I will show it now please follow along and then you will be able to work on the Hands-On excises and then learn how to even run some demos with our Intel PVCs so hover your mouse over the portal Tab and click on the ACES portal (ACCESS) this is the ACCESS this will direct to so we will select the iden

tity provider ACCESS CI XSEDE and I'll just click on log on so I have saved my ACCESS username and password you may need to type your username and password so after that I click login here I will need to do the Duo two factor authentication I will approve on my phone so now we are on the ACES ondemand portal so we have you can see a different tabs here so before I continue I would like to know if everyone is on the ACES ondemand portal now okay Elizabeth did you encounter any error or do y

ou need it more time I needed time I have to get to this side I'm using another system so that I can be looking at it I want to know which website can you repeat your question is it Texas A&M High Performance Research Computing that is what I want to know which one to use on my end from here which one to use we use the ACES portal okay let me okay now once you click on the ACES portal as I did before you will need to choose the ACCESS CI XSEDE and then click log on and then you will need to

provide your ACCESS username and password and then go through the Duo two Factor authentication and a click login it should bring you to ACES on demand portal and I would like to introduce the features of our ACES on demand portal for example here if you click on the files tab for example this is my home directory and the scratch directory if I click on that you can see the files I have and we will be downloading these I would like to show you how to use the file editor on the portal so wh

en you see some files for example here and you can use edit we will do this a little later as well because when we edit I think it'll be easier for you to edit in this file editor if you prefer to use vim of course you can use vim as well and the next one is about a job and we have active jobs composer for example when you submit a job to our ACES Computing cluster you can check the status of the job whether it's completed or it's pending or it's running so and also cluster here we will be

able to start ACES shell and also we have different interactive apps here the VNC and also we have NextSilicon VNC and if you use the Jupyter notebook Jupyter lab we also have these Rstudio tensorboard we also have Affinity groups you are welcome to look at our Affinity groups ACES group and our HPC group and a dashboard so probably this will show you some a what's going on here so it will show you some of the disk usage the limit I think it's under maintenance maybe it doesn't show the nu

mbers disk usage so you can try out these dashboards I think maybe this one is probably the new one after that please click on the clusters and click on the ACES shell access so we will be on the login node of the ACES Computing cluster I'm on the login node one I think we have several login nodes the Clusters drop- down menu please cluster drop down menu yes that's the one yes this is the one you should click on the underscore ACES shell access so you click on this this will direct you to

here so now I'm on the login node login node one so now we are all on the login node of ACES Computing cluster so this is a success welcome to the ACES login node next we on the login node we can check the PVC slurm nodes the status so for from here we can view the PVC nodes and the number number of GPUs so we can use these commands if you have the course presentation slides you can directly copy and paste the commands into the terminal so for example here I will do pestat -p means partitio

n -g shows the number of the GPUs now you can see we have these nodes I guess six six PVC node slurm nodes and they are in the PVC partition on each node we have four PVC GPUs and now the are their state is reserved they have been reserved for the short course so I reserved them so today we are going to use all these use the PVC GPUs next we will need to copy the training materials to your personal directory first we will cd $SCRATCH to navigate to your personal scratch directory and we ha

ve save the files for this short course at scratch/training/aces_pvc_course so you will need to use this command cp -r scratch/training/aces_pvc_course $SCRATCH to copy the the training materials to your to your personal directory and then we you can enter your directory your local copy as well so I think I will give I have download this into my scratch directory so I will give you like two minutes to copy the training materials to your personal scratch directory so it should be like this

the second lab is about using PVCs on the ACES Computing cluster these are mostly demos but you are encouraged to follow along and around the the models like in pytorch and tensorflow framework with the Intel PVC GPUs and also after that you will after running the models you will get an output file that shows the performance of the Intel PVC GPUs as well so here first we will need to set up the environment for the first one we will do pytorch model so we will need to set up the environment

for pytorch models here actually we have prepared a job file where the environment has been set up there so we will go through line by line so you know how the environment was set up so first we will do module purge to clean your environment and then we will module load the intel/AIKit we use this version 2023.2.0 I think this is the latest version but I need to check. So this is the Intel AI Toolkit. This has been installed as modules on our ACES Computing cluster so you can just simply do

module load Intel/AIKit you can select a version and I think we have another version as well and then we will do module load Intel toolchain and we will be using the 2023.07 next we set a environment variable called in name for for this demo we will be using a pytorch so so we will be cloning the environment in the AI tool kit into your directory so we give make a variable named aikit-pt means pytorch-gpu-clone and if this this means like if the environment doesn't exist it will clone the

environment from the ai tool kit so conda create -n $ENV_NAME --clone aikit-pt-gpu so this will help you to clone the the the pytorch the environment from the AI tool kit and then we can source activate the environment this is all in the pt_demo.slurm job file you if you have copied the training materials to your personal scratch directory it should have been in the pytorch folder we will go there and look at them together so okay I'll go back to the terminal and if you list and cd pytorch

you can see the job file here so has everyone can you open the file in the file editor pane instead of the vim yes that that is also okay if you use vim you can open the file from here if you want to use our file editor that's also okay here I think at the beginning you should be you can enter the files tab you can click on the scratch user and your username if you click on this and I believe you have copied the aces_pvc_course here and we have three sub folders one of them is pytorch if y

ou click on pytorch you will see some files ignore this one because I run this job this morning so you can see I have some output file here but I want you to open this slurm job file you can click on this button and then you can see the edit click on edit you will be able to open it from here oh sorry so the first part is about the job specifications you are requesting you will see the job name and the wall time is one hour requested and we request a one task CPUs per task is eight and we w

ill use one node the memory is 100 gigabyte and the output file will be pt_demo and followed by your job ID we will use the gres we will request a GPU and we want the PVC GPU and one PVC GPU from here and from the PVC partition today as I said we will be using the reservation called training so this is the resources that we requested and the next part is I have explained to you about the environment setup so and then active the environment this will cd to the resnet50 Intel AI models resnet

50 folder training so we will run this training script this is prepared by Intel so we can just run it and see the performance of the Intel PVC GPUs okay I will come back here do you have any questions so far can you see the file in the file editor on the open on demand portal so next we will be running we will submit a job to the ACES Computing cluster so for here sbatch do this command sbatch pt_demo.slurm submit job to the ACES cluster of course you can use the pestat -p pvc -G and we c

an do this I will just -n 5 maybe every five seconds access okay we can see we have some students get on the using the PVC now so let's see yeah we can see we have some users using the PVC GPUs that's great and I think my job has been done I guess I can use squeue -u u.zh this is my net ID so you can see I have no jobs running I think my job has been finished and the job ID is 29466 so we can go back to the file editor I'll just refresh this page if I open this you can see it tells you are

using the bfloat 16 Precision for training the resnet50 and also you are using the data set for training because I didn't specify a data set directory so for the imagenet image net so it's using the data set of course if you want to train the model on real data set you can specify the path to the real data set and the resnet50 training running 1 stacks actually this is very small because I only let it around like the first 20 iterations so it's very fast of course you can let it run more E

pochs if you really care about the training get a better training model so here for demo purposes I just let it run like 20 iterations so you can see and all these and also training performance batch size we use 256 and the throughput is 1230 images per second so this is the output file if you want to run more Epochs longer and you can do that after the class in the file here let me back to the slurm you can see this file you can set the path to the data set and also you can specify the Pre

cision that you want to run the model with and also you can set set the batch size as well I just used some default parameters so this is just for a demo purposes so that's great every some of some students are using the Intel PVC GPUs that's a great start I think let me go back to my presentation slides actually I know some users some researchers would like to prefer using the python virtual environment this is just for another alternative for reference so please do not type these commands

into your terminal now because we will not show the python virtual environment today you can try this after the class actually I have tried all these commands already so it should work here I would like to point out that if you want to install the torch and the Intel extension for Pytorch when you use the AI toolkit actually this has already been installed for you so you don't have to care about you don't have to worry about these but if you want to install them by yourself you have to car

e about the versions so for here for example the torch is 1.13 and you will have to use the matching version of Intel extension for pytorch now here also use 1.13 and what when you want to do the distributed training you will also need to install another Library called oneCCL Intel one API Collective communication library findings for pytorch also you need to find a version that is matching with other packages these are all available on Intel website and you can find the matching versions

of the different packages so if you if you want to try the python virtual environment but today we we don't have time to work on these so we change to pytorch directory and then we submit the job file already so that's great okay next is about tensorflow but it's 51 minutes now probably I want to take a short break how about we take a seven minutes break and then we come back at 11 so let's get back and at 11 and we will continue from the tensorflow models hello everyone welcome back we wil

l continue with the tensorflow resnet50 model demo also first thing first we also need to set up environment for tensorflow models and for here similarly we also use the Intel ai kit so we just a clone the environment why we want to do a clone of the environment because you have the freedom to install other packages into the environment for your own project for example probably you need a like tqdm or tensorboard that that are not included in the aikit-tf-gpu but you can when you clone the

environment you can source active the environment and then you can install the module that you need for your own project so but for today's I think for today's short course we don't need a lot more but for the future benefits so and this is also in the tf_demo.slurm so so we can change to the change directory to the tensorflow folder and then sbatch the tf_demo.slurm let's see so my terminal is still active so let's see now I need to change to tensor flow and you can see the tf_demo.slurm s

o we use the command sbatch tf_demo.slurm and you can see your job ID here 29487 this is my job ID and also I can for example watch squeue I will try --me yes Richard is working thank you so you can see my job is running ac026 PVC node in the slurm so it's working and also we can watch pestat -p PVC -G so we have some users using these PVCs so it's running I guess this tensorflow demo is a little longer it takes about five minutes so if you have any questions you can let us know okay we ha

ve been using this user is using ac086 please give a you can either thumb down if you have a problems running the jobs so while it is running I would like to go back to my slides and see what is the next okay the next one is about converting your pytorch code to run on PVC so this is the next one but I still want to see how many students are on the PVC now we have a few more okay while it's running I will continue the third lab pytorch on PVC so if you have a pytorch model pytorch code how

can you convert it and make it work on the Intel PVC GPUs here we would first we would like to introduce Intel extension for pytorch it's a python package for extending the pytorch models to run on a Intel platform like the Intel PVC GPUs so first and first we will have you will have to add the following import statement to the beginning of your script for example here import Intel extension for pytorch as ipex so this is the first thing so we will have to incorporate to make it work on In

tel PVC GPUs it's very easy right the first step and the second step is to move your model and a criterion and to the xpu it's also easy so you can do just your model you have created your model and criteria and then you just do model to xpu criterionsim to xpu so move it to xpu and also after that you will need to apply the ipex optimize function against the model and optimizer objects for here this is the arguments ipex optimize it will take your model and up your Optimizer and here we us

e a data type bfloat(16) and actually it also support fp32 you can try out to the bfloat16 here and after the class you can try the fp32 as well because for the next when we do the fifth step we will be using the bfloat 16 so they have to be consistent for here also the bfloat 16 will give a better throughput compared to fp32 so here we choose bfloat 16 here the first step is move the data and target to xpu very similar in other like a GPU code in the training loop you have to data when you

unpack your train data loader you have the data data to xpu target target to XPU you so these are very simple change as I said the fifth step we will use the Auto mixed the Precision here with the bfloat 16 data type we will use this context manager torch xpu amp. autocast it accept the arguments enabled to and a data type to be consistent with the previous one so torch bfloat 16 this is all the steps so this is just a few lines of change in your code and it will be able to run on the Inte

l PVCs here I have prepared a exercise file called a cifar10_pvc_todo.py I have listed these several steps in the to-do file so you can use the our file editor of the open on demand portal you can add all these changes to the file and read through the code and familiar familiarize yourself with how to change the code the pytorch code to PVC code and after completing all the to-dos in the Cifar10 PVC to-do file you can modify the slurm file and submit the job for example you can just as you

will have to change the python command for example python you want to change the file name to be cifar10_pvc_todo.py okay let me show you here I will go back to my pytorch folder so go to the exercises folder exercises here you can see the todo file here cifar10_pvc_todo you can use the of course you can use our file editor here I'll just close too many oh there has been this is for the pytorch and click on the exercise folder and we have this todo file you can click on this and then edit

so you can see Step One Import the package of Intel extension for pytorch and give it an alias ipex write your code below so you can see import fix me you will have to replace fix me with the correct code as shown in the slides so this is the first step and after that you can read the code through and then in Second Step move the model and the Criterion to xpu so now we have our model and the criterion in the second model or in the Second Step you will need to do model equals model. 2 and t

hen xpu and the criterion spu so you have to replace the fix me with the correct code so this will be the several steps you change it to to a PVC code if I go back you can see we have a solution but I don't encourage you to look at the solution now I want you to First Complete the the exercises first by yourself and then submit it to our Computing cluster and also we have another file called cifar10_cpu this doesn't have the model to xpu and all these lines or I just commented all these lin

es so the the the the pytorch model will run on a CPU you it will take a long time I don't encourage you to run this file during this class if you have interest you can after you can run it after the class so it takes several hours but for the PVC it could just a few minutes so the we just want you to feel the speed up by using the Intel PVC GPUs so now I will go back to this slide so I will give you like 10 to 15 minutes to work on these exercises the the to complete the to-do steps in thi

s cifar10_pvc_todo after that you can submit the job I have prepared the solution I always want to use the vim but I think I want to advertise our file editor here so you can see this is the solution and the code changes are like you import the Intel extension for pytorch and also you move the model and the criterion to xpu and we use the ipex optimize and you take the model Optimizer and a data type and for here we in the training loop we move the data and target to xpu and also we use th

is torch xpu amp AutoCast manager and then to wrap it so this is just a few steps so you're welcome to try it on your own code okay I guess we will continue tensorflow on PVC so tensorflow on PVC it also has a corresponding package called Intel extension for tensorflow for pytorch we have in Intel extension for pytorch it's also this actually enable your tensorflow model to run in xpu like GPU CPU Etc and also here you can check the version of course using -- double underscore version and t

o print the version you used good thing is that if you install Intel extension for pytorch in your environment the default device will be the Intel GPU for example if you have the Intel PVC if you installed this package in your environment it will automatically use the Intel PVC GPU you don't have to do any code changes so that is a great news right that's a great news for tensor flow so if you use tensor flow you just need to install the Intel extension for tensor flow into your environmen

t and then you don't have to make any code changes it will pick up the Intel PVC GPU and run your model with it so so this is a great news so this is for great news for the tensorflow users here is a very simple exercise you don't have to change anything you just needed to actually modify the slurm Jobfile tf_cifar10_pvc.slurm and submit your job you can see it runs on the PVC already so this is a screenshot of the job file you you can add comment these two lines and to let it run on the PV

Cs so let me go there and try that this is what is a change tensorflow and I guess it's in the exercises so we have the slurm file here I'll just use name so you can see please and comment these two lines oh yes here and then submit this file I have a job ID number 29502 I'll just watch squeue --me it's pending now request node not available maybe let me see what's happened there okay because all the nodes have been reserved so if you do not put the reservation your node will not be able to

find it so let's let's add this sbatch directive here -- reservation equal training so let's add this line and try that again so because it cannot find a node it can use so now I have the new job here so because it's in I specified to use the reservation reservation so he can find it and I'm using the ac086 it's running so it's just a line I will check you can also check the PVC dash G oh okay just so we have some users using these PVCs that's a great news and I myself is using the ac086 s

eemingly I'm a very alone on this node I need somebody to get on this Node as well so okay this is very easy and I would like to actually look at the the file so you can see actually we have imported the Intel extension for tensor flow but actually it doesn't have any these use any of these it doesn't use the the itex at all so I think if you have the the Intel extension for tensor flow installed you don't even have to put this line of code to it let me try I I'm very curious about this if

I do not use these two lines can it still be landing on the PVC nodes or it will give me some first let me see let me sbatch TF_ okay oh sbatch so 29507 I will just watch squeue --me so it's also running ac086 how can I check if it's really using the PVC GPUs we want you to always follow the good practice and we want you to probably when you run this you want to use the VNC for checking the PVC usage I will introduce it a little later but we can see actually it's it's landing on the PVC n

ode do you have any questions so far for the tensor flow tensor flow is a easy easy one right so you don't have to do any code changes okay I think I will move on we have some PVC monitoring tools I just introduced the pestat -p partition PVC -G to show the number of GPUs and this is one you can use this one to check the availability of the PVC GPUs and also actually we can monitor the PVC system activity use the sysmon it will tells you the memory being used on the PVC and other informatio

n as well another one is to use the Intel xpu manager it's xpu m manager CLI command you can use the stats dash d and you can give it a device index for example you you use zero and you can give it zero and you you will be able to Show the memory used and GPU utilization oh because sysmon if you just use it now I guess it's it's not on the you are on the login node now so so it's not on the PVC node so sysmon error it's it's expected so only if you data on a PVC node you can use this sysmon

here is how we can directly land on a PVC node for example we can use the VNC you can start a VNC I have introduced the about the interactive apps so you can click on VNC choose the node type for example Intel GPU Max and the number of GPUs for example here I choose one if you have distributed training you can choose more and for example two three or four and number of hours number of cores total memory and after that you will be like you will after a short wait you you will be able to see

this blue button called the launch VNC click on the launch VNC will be bring you here for example in the VNC we can run some jobs I don't know why I just covered my user username so I think that's okay and then here I change directory to the to the pytorch exercise folder and I do some module load of the Intel AI tool Kit and also module load of the Intel toolchain i also activate our cloning environment called aikit-pt-gpu-clone and here on this note I run this cifar10_pvc solution file a

nd after that you can use watch sysmo to see the memory usage and this shows your process is attached to this GPU this is a group practice to check the PVC memory usage and but if you want to check the power and temperature of the GPU you probably want to run the command xpumcli stats -d device but for today I guess I will just show this I did this before the I think it it was yesterday yes it was yesterday I tested and you can also use the hour vnc session to check the PVC usage I think th

at's all for the AIML part in pytorch and tensor flow next I would like to invite my colleague Richard Lawrence to do some to show some demonstration and maybe exercises on Runing lammps the molecular Dynamics simulation on PVC GPUs Richard are you there hi this is Richard I'm a user support specialist at Texas A&M and I will be talking to you about running lammps on PVC GPUs I hope you can see my screen it should look pretty much the same as the other one so what is what is lammps has anyo

ne heard of lammps before lammps is a molecular Dynamics simulation engine and lammps has a modular back end for GPU acceleration so there are two main lammps packages that serve that that purpose the GPU package uses OpenCL or Cuda to communicate with GPUs and this this package divides work among the GPU and the CPUs there's also the Kokkos package which uses a different type of declaration language entirely and that supports many more types of accelerators including different types of CPU

s the default strategy if you're using a GPU with the Kokkos package is to put every calculation on the GPU I've used both of these they they do both work well today we will be focusing on the GPU package which divides work among the GPU and CPUs for other purposes the Kokkos package may be better I just don't have an example prepared what's different between these two is primarily there's a slight change of syntax when you launch your executable depending on which package you're using to

communicate with the GPUs mostly your input files are not going to vary very much lammps is pretty good about you know behaving as though this is a modular back end so ACES we provide a build of lammps that's compatible with the Intel GPUs and I have put it in a module so that is a the lammps module which has the version named 3August2023-Intel-2023.07 so here the first phrase 3Aug2023 refers to the build of lammps itself so that is a release of lammps from that time and the second phrase I

ntel 2023.07 refers to the version of the Intel compilers that were used to compile the lammps code into executables and because the module system on ACES is hierarchical you must load your tool chain before you load your application so the syntax is module load Intel 2023.07 then module load lammps 3August2023-intel-2023.07 however the because there are only a few things available on Aces that are specific to this tool chain it's actually good enough to just say module load Intel 2023.07 l

ammps and there's only there's only one reasonable option that the module system can consider so it it picks the right one so it saves you a few keystrokes where is this installed when you load the module it sets an environment variable named hprc root lammps which just points to the directory where this build is installed so we can take a look at it and see where it is you could consider this to be a small exercise if you wish module load Intel/2023.07 LAMMPS sorry it's hard to see on my

screen I'm trying to type an underscore but it's kind of invisible don't know what happened so this particular build of lammps provides an executable named lmp_oneapi if you recall oneapi is Intel's name for the collection of compilers that are meant to be used across multiple types of devices it is common practice in the lammps community to name your executable after the type of devices it is meant to be used with so for example you might see lmp_MPI or you might see lmp_Cuda these types

of things are common so Nvidia has followed that Paradigm and they've decided that this executable will named lmp_onempi otherwise other than the name change it functions identically to all those other lammps executables you may have seen so how do we how do because I said that this build of lammps distributes work between the CPUs and the GPUs that means we will be using MPI as the layer for communicating among the CPUs I've discovered that MPI doesn't play so well with srun it might be a

bug or maybe I just don't understand how srun Works but in order to get the correct performance we're going to be using sbatch so sbatch file name and somewhere in the file there's your MPI run command so the Syntax for mpirun is -np for a number of processes and then and then and and here represents any variable number of processes for the MPI thread configuration we have here some suggested variables I don't know what all of these do because I'm not actually an MPI expert I got these from

our Intel representative who suggested these so we're going to try them for this demo it is not known to me if there is a better combination for your specific use case one final new thing is called the compute aggregation layer this is a tool that Intel released which is a it's a helper for when you are running applications that use both MPI and openCL so for example the lammps GPU package that we're using today is an example that uses both MPI and openCL so naively if you MPI run a bunch

of processes then every process can send openCL calls to any of those GPUs that might result in communication bottlenecks or other inefficiencies So Cal the compute aggregation layer works as a middleman between your MPI processes and your openCL devices it collects incoming openCL calls from the MPI processes and more intelligently handles them out to the open Cal devices to reduce communication bottleneck that could occur due to the MPI processes not not respecting each other very well so

so to speak now the the utility of this of course entirely depends on how much work Intel puts into making sure that it makes good choices I'm not really an expert on that I just know that we're going to be using it today because it is essentially recommended by Intel so let's give it a try in order to use Cal run you just prepend your MPI run with the Cal run executable that sets up an environment so that it's going to intercept those calls from MPI processes redistribute them to the GPU

s intelligently Cal run executable is provided with the lammps module so you do not need to load any additional modules it is already in that same installation area in order to Benchmark lammps we have some test files that were provided by Intel so these are Intel's variant of common molecular system input files for lammps I know what some of these are for example on the bottom left we have LJ that stands for Leonard Jones which is basically the billiard balls model the atoms can bounce off

of each other when they are in close contact and that's it and on the top right we have rodo which is short for rhodopsin is a protein that is solvated in a bilipid layer with ions it's just about as complicated of a molecular system as you can design so we run the full gamut here of very simple to very complex and you can try any of these out the default one is LC it's actually one I don't know much about but we can mix and match for fun so try any of these that you think are relevant to

you the test files are found in hprc root lammps/apps/TEST so let's navigate to that directory and see what's there yep those those files are there I'll actually return to the previous I was in my home directory so I guess that's not very helpful okay so I have provided a demo slurm batch file which probably should be in the directory that you already copied so earlier we made a copy of the ACES PVC course and one of the three directories in it is named lammps and in that lammps directory t

here's just this file the lammps demo. slurm so let's take a look at what's in this file so I here have my ACES file Navigator open and I have navigated to the ACES PVC course lammps directory and in it you can see the lammps demo. slurm file so I'll just click the view button to open this in a tab so we can read it and go over what is in here so the these top few lines are normal looking slurm parameters things like we want to be on one node we need some memory we need some time and in th

is demo I've picked one GPU and 16 cores so as your exercise you may choose to vary these recall I said that the lammps GPU package will distribute work among the GPUs and the CPUs so as you vary these numbers you will see varying performances here's some printing out some fun facts about the node that your job will land on I find Handy here's where we load those modules load Intel 2023.07 load lammps 3August2023-intel-2023.07 I create a variable for the output because lammps does produce a

log so we'll just have a log directory and create it to make sure that it exists here's the MPI communication settings I previously told you will be that were suggested by Intel I don't have a reason to change these so I've just put them here if you're an MPI expert you might choose to play with those and see what happens next we need to navigate to the directory where those input files are located in order to use them this now this CD will happen in the running job though this will not h

appen on your command line you're in your command line You'll simply be located in the folder where the slurm file is and then I've created a variable named in which I just choose to be one of the a input files that is located here in the test directory so this default that I picked for you is LC perhaps we might be interested in what is LC I don't know it says it's biaxial ellipsoid mesogens in isotropic Phase I know what the word phase means the rest of that is a mystery to me what were

we doing we were reading this file okay so let's read these lammps arguments if you're not familiar with molecular Dynamics I can give you a real quick rundown on what these options do the first option is obvious Dash n just tells which input file we will read -V is it sets a variable the variable's name is capital N the Intel input files all have this variable defined which turns on and off the the Newton parameter of molecular Dynamics so in molecular Dynamics you have to say you're Compu

ting the forces on 100 atoms so first you'll compute the force on atom number one due to atom number 99 and then at a later time you compute the force on atom number 99 due to atom number one if you're in serial or in a shared memory environment you might notice that those are actually the same magnitude so if you record the magnitude the first time you calculate it then the second time you need to calculate it you can just read it out of memory in save time but we are not in a shared memor

y environment we are Distributing work of across multiple devices including GPUs and CPUs So reading the result of the force calculation from the previous time that it was done isn't actually faster it's actually slower it it is faster to Simply recompute the force a second time so we turn off Newton's third law here and we say we're not going to assume that the force on a is the same as the force from a because it's faster to just recalculate so off PK is which package we're using for our

which package we're using to to access GPUs we're using the GPU package not the cocost package if you were using the Kokkos package you then this would be PK KK for Kokkos and then there's an additional argument when you use the GPU package package which is how many GPUs so because we previously defined how many GPU GPUs up here with the gres parameter instead of writing that number down again I'm going to read it out of slurm using the slurm variable slurm GPUs on node that way if you deci

de to change this number up here lammps will get the updated number of GPUs that are assigned to this job then we have the flag SF GPU which which means that we're going to be assigning Force calculations to the GPU I believe that's what that means so some amount of work will be assigned to the GPU and the other calculations will be done on the CPUs lammps will output two different ways it will either output to a log file or it will output to the screen we don't need both of them so we're

turning off one of them we're not and we're going to set the other one to the output file in an output directory we previously made and for the special case of rhodopsin there's an additional parameter named D which has to do with diffusion and we're just making sure that it's set to zero and I don't even know what it does sorry about that so the last part of the job is we have to put all that together we have to say we're going to do a calrun mpirun number of processes equal to the number

of tasks that we request it up here so every core gets assigned one MPI task lmp_onempi and then all those arguments that we saved so that's that's that's what this file does are there any questions about what's in this file or would you like some time to edit the file otherwise the next thing we're going to do is submit the file to the batch system and see what it does because I did not put the reservation flag in the batch file if you wish to use the reservation you will be providing th

e reservation flag on the command line instead so it's sbatch --reservation equals training space the name of the slurm file with new how much faster with Newton's third law off I don't know for every case whether it's much faster or not with Newton's third law off that's one of those things where you you would need to do the experiment to find out it's off because at least in theory it's better to be off when you're using GPUs but whether it's actually better you would have to try it and f

ind out in my experience it depends on what kind of system some systems don't have a lot of force calculations for example the Lennard Jones the billiard balls model there are very few Force calculations so Saving Time on a handful of force calculations doesn't really matter but then you have like electrostatic models like EAM the embedded atom model where there are a lot of force calculations so having Newton's third law matters more for those cases and no I don't know all of them I'm not

really a chemist I just know how to push the button that makes it go how many GPUs per node are they all in use we can double check pestat -p pvc -G so we have here several nodes that are reserved today each one has four PVCs on a node so you can request a number of PVCs from one to four for this experiment the default in the file was four was one but you can change that can lammps use different GPUs on the same node for distinct MPI ranks in theory yes however because we're using calrun as

the integration layer between MPI and the GPUs we are not really in control of that instead we're deferring to Intel on their expertise for the best use of MPI with their GPUs if you didn't use calun you could manually specify how your MPI ranks are getting assigned to GPUs in the case of for example the Kokkos package where all of the work is going to be moved onto the GPUs it is better to have just one MPI rank per GPU so they're not competing with each other for access to the GPUs for

the GPU package where work is being divided among the GPUs and the CPUs it's more complicated and in general I don't know the answer someone is pinging me oh that's unrelated what were we about to do I forgot already thank you Carl for your good questions I'm sorry I don't know six ranks they are in parallel so MPI run launches the 16 processes together we can explore that we can explore that as we submit some jobs do we have time I think we do yes so we'll explore that Carl after we submit

some jobs so I just need to go to my directory where I have these files does everyone have a directory like this with this file so I'm going to do sbatch --reservation equals training lammps demo.slurm there is an underscore in the file name just for some reason it's not showing on my screen and I don't know why yeah I don't know that's weird so I submitted batch job 29518 so I have been assigned to node AC026 and my job is running so ac026 psux oh looks like I'm too late yep the job is al

ready done if you want to catch it in action we'll have to make it run for longer so let's take a look at the output so we get two outputs here we have a lammps demo. job ID number which just has a handful of statements explaining what was run I always like to print out these statements because if I come back later and I forget what I was doing it can be very hard to figure out what I was doing so here we have a record now that this job ran on ac026 and we had 16 processes and the log file

went here so we could take a look at that log file if you know how to read lammps log files this particular problem statement had 32,000 atoms and the the 16 MPI processes were divided into a 4x2x2 grid of MPI processes then because there's a replicate statement the system is replicated 16 times up to 500,000 atoms that's standard practice for lammps benchmarking we can scroll down a bit and you can see that here there is one GPU device that was initialized it happens to be the device numb

er zero if we scroll down a bit more you can see that the 16 MPI processes are all able to see device zero if we scroll down some more we can see that the performance over oh sorry this was the warm-up 10 steps if we scroll down some more you can see that for the 840 steps after the warm-up 10 steps the performance was 41 million atom time steps per second that's a a standard unit of performance for lammps benchmarks we can see that this one spent most of its time doing pair calculations an

d other which probably has to do with waiting for communication between the devices the total wall time was only 19 seconds which is why I had I didn't catch it live we can try again and I'll try to catch it live sbatch squeue same node there we go so if I hit it with ps while it's running you can see that there are many many copies of lmp oneAPI that's because the MPI run launches many copies of the executable and it's the job of MPI and Cal to make sure that all these processes are talkin

g to each other correctly I believe that means we have enough time to try some experiments so is there another combination you would like to try perhaps a different resource count more GPUs perhaps a different molecular system one besides the LC I don't have a specific exercise except we can make some changes to this ex to this demo file and see how see how what the how the result is different is there an optimization command we could test what were the choices again let's review the choice

s oops so if we look here at our lammps demo. slurm let's suppose I go to edit the file the first few parameters are requesting resources so one option we could do is we could change what resources we request we could request a different number of PVC GPUs any number from one to four I don't have a multi-node example setup so it will have to be on a single node we could also change the number of tasks from one to I think definitely 64 is fine I don't remember how many cores are on the Node

is there 96 cores on a node does anybody know but I've tested up to 64 and it worked fine and then the other option was we could pick a different input file so here I we picked LC we could pick one of these other ones instead airebo dpd eam lj rhodo snap sw tersoff water can you specify which GPU to use instead of lammps pick automatically because the GPUs are allocated by slurm it's not a good idea to try to manually pick a GPU because if the if slurm gives you a different GPU than the one

you hardcode into your file it your your process won't be able to launch at all generally we let lammps pick the GPUs but to illustrate what I mean why don't we just change this number from one to something else for example we could change it from one to three so if I click save here now if I launch that it's going to try to use three GPUs and we can see how the output is different Which choice has the most atoms I don't know which Choice has the most atoms if I had to guess I would say t

hey all have about the same number of atoms the standard practice is to pick a a system and then replicate it up to a specific size so for for small for crystals that only have like one atom per unit cell you replicate it 500,000 times but for something like rhodopsin you don't replicate it at all instead you just add water molecules until you get to the Target atom count that's a standard practice in the lammps community but I don't know for a fact how many atoms are in each of these files

I didn't explore all of them so we look in the logs we can see there's now a log file for the job that ends in 23 so if we explore that we can see what's different same number of atoms but now it says that it found three devices numbered 0 to 1 and two it did not find device number three that's because slurm arbitrarily picked for this job to have device 0 one and two and to not have device 3 next we can see that it initialized devices 0 to 2 On Core zero that means all three of the GPUs a

re visible to core zero all three of the GPUs are visible to core one if this was all there was to the story you would have a huge communication bottleneck as every core tries to talk to every GPU so hopefully something is going to stop that from crashing your system and I believe the meaning of it is that Calrun is going to stop that from crashing the system because calrun is going to intercept all of the openCL calls from the cores and redistribute them among the GPUs intelligently so we

don't have that bottleneck we can see that the performance increased from from 41 to 49 million atom time steps per second so with three times as many GPUs it took a little bit less time the the the communication time increased dramatically it used to be mostly pair and other now the communication time and the modify times are more significant so what that's telling me is that the additional work done by the additional GPUs was kind of negated by the additional communication time that was

needed to to get those GPUs to work with each other these are the kinds of things you need to be aware of when you have lots of choices of ways to divide work among devices the wall time was overall less but not by a lot with nodes equals one you're limited to tasks equals cores tasks equals cores is just one way to ask slurm for Cores you can also ask for CPUs per task you can also ask for all the cores on the machine and a number of task is just my favorite way to request cores there is

more than one way in slurm to specify your resource that you need usually they're they're equivalent lammps for example doesn't care about the distinction between a task a process or a thread it it will just choose it will just launch processes and assign them to cores as best it can but there are other applications where the distinction between a task and a thread actually matters but not for lammps specifically if you request n tasks equals number of cores plus one I don't know what happe

ns if you request contradictory resource requirements from slurm I'm guessing that it has some kind of internal tiebreaker and it just ignores the parameter that it doesn't understand also you have to be careful because some of the slurm parameters describe the node you want to land on and some of the slur parameters describe the resources that you need so you could say I want to be on a node that has 96 cores but I only want to use two of them that's something you can do in slurm so just y

ou can't just assume that all of the all of the parameters get you what you need that's why I try to stick with just using end tasks because I understand what it does so I don't have to worry about slurm jobs that don't that don't work properly you could see increase the nodes tasks and see what would be the performance performance boost you and Shen would like to know what happens if we vary the number of tasks would you like to suggest a different number of tasks than 16 perhaps double t

o 32 shall we double to 32 and see what happens do you want me to put the number of GPUs back to one or leave it at three GPU is back to one okay so I clicked save the button changes to darker color after it's saved and then we can submit that demo again and it is running on AC026 all my jobs are landing on AC026 today okay and it is done so now we can investigate the log for the job that ends in 25 and see how it's different you can see that initialized more processes and the number of ato

m times steps per second actually decreased so it used to be 41 atom times million atom time steps per second and now is 31 there's a couple possible explanations one is there weren't enough atoms in this Benchmark to keep 30 to cores busy if you initialize more cores but don't give them any work to do you're just essentially wasting time initializing cores so in order to put 32 cores to work maybe we needed to have more atoms that's one possible explanation another possible explanation is

maybe the GPU is getting hot because I used it repeatedly you can see I keep landing on device zero on node AC026 so it's possible that that the node just is too warm now but I haven't really gone into detail about about measuring the temperatures Carl asks could you re-elaborate more on why the comms overhead is so much greater with three versus one GPUs I'm not a lammps expert I just know how to push the button to make it go but my experience with lammps is that you you want to give every

device that's going to do calculations enough atoms that it's going to be busy doing calculations for a while because communication between devices tends to be the slowest part of simulations so if you if you give a CPU or a GPU like 10 atoms and then immediately say hey give me the results back you're spending too much time communicating you're going to bottleneck your simulation on communic communication you want to give each device like a million atoms so they can all do their thing fo

r a while and and that dilutes the effect of communication as the bottleneck of your system are the GPUs loaded sequentially I do not know I do know that the atoms are divided among the GPUs according to a a approximately spatial partition lammps lammps divides the space of your simulation into chunks and assigns a bunch of neighboring atoms to the same device so if if the atoms need to talk to atoms that are in a different partition of the simulation then the two GPUs that are in charge of

those atoms need to communicate with each other communication between GPUs is not as fast as the GPU doing its own work so the more you partition your system the greater the surface area between the partitions the more communication has to occur so with the same number of atoms and you add more GPUs you're increasing the average amount of communication that needs to occur so if the GPUs aren't doing enough work to make it worth that added computation then you don't see any performance bene

fit that's why it's important to know the size of your system so that you know that each device is getting allocated an adequate number of atoms to keep it busy so you don't have to think as much about that communication bottleneck that's not specific to Intel or GP that's that's really just a lammps thing in other simulation Frameworks there are other considerations thank you for your good questions you've pretty much reached the end of my knowledge about lammps I'm not a chemist so if you

have more questions about lammps it might be better to go to the lammps community and ask them what they think with no more questions that brings us to the end of our course I would like to share with you our acknowledgement slide this work was supported by the National Science Foundation so here are some awards obviously the aces award which paid for the accelerator test bed that we've been using but also we have some other NSF Awards we have the sweter award Southwest expertise in expand

ing trading Education and Research sweter pays for some of the development work for the educational materials and things like user user support if people need to travel to to group meetings to collaborate on these user trainings if they're in person sweder does those kinds of things within some within Texas and a few neighboring states specifically and then we have the NSF award for faster with which faster is the flagship let say Flagship it is a production cluster that focuses on composab

le nodes and a lot of what we know about running code like AI and lammps we learned from when we were doing those tasks on the FASTER cluster so when I say things like in my experience you need to assign enough atoms to each device to keep it busy that experience comes from the time that I spent working with the faster cluster so that means the that award did in some way contribute to this training we have an Intel representative Doney aruki from Intel Who provided us with the software buil

d capabilities for the for the AI and the lammps technologies that we explored today and also I would like to thank the staff and students at Texas A&M hprc without whom training sessions like this wouldn't be possible we it takes a lot of people to keep a cluster running and make sure everybody has access to it in time to do your exercises I think that's the last slide no it's not the last slide so here's our hprc helpes email help@hprc.tamu.edu that's how you contact us if you have quest

ions if your job's not running you you just please do tell us the basic information like what cluster you're running on and your username and we would like to help you solve your problems okay okay that really is the last slide in the slide deck I guess I should ask my colleague jenoa if he wanted to add anything else before we dismiss the the class no thank you Richard and I really want to thank everybody who has attended our short course introducing Intel PVC GPUs we also welcome collabor

ations on like benchmarking our accelerators such as Intel PVCs and also you know we have various types of accelerators if you have like your you want to speed up your sence with any type of the accelerators available on ACES clusters you're welcome to contact us and we are here to help you and also help us ourselves as well thank you thank you for for joining for today's short course thank you very much

ACES: Introducing Intel PVC GPUs

Related articles

Comments