Table of Contents
1. Overview
I will give a demonstration of the various aspects involved when using GPUs to speed up training of deep learning models.
I will talk a bit about
- Prerequisites
- Need for speed
- Tensorflow
- NVIDIA GPUs
- CUDA, cuDNN
- Python virtual environment
2. Prerequisites
The terminal commands should be applicable.
Requirements:
- [UNIX user registration](https://foswikia.ux.uis.no/Info/UnixUserReg)
- [UiS AI Lab registration](https://gpu.ux.uis.no/user_registration.php)
You also need to be able to log on to the UNIX server.
- Using Virdual Desktop [Nomachine](https://foswikia.ux.uis.no/Info/NX)
- Terminal login with [SSH](https://foswiki.ux.uis.no/bin/view/Info/SshCommand)
3. Deep learning
Deep learning (DL) applications can involve
- millions of parameters to be trained
- exhaustive hyperparameter searches
4. Need for speed
Example where training a single model required
- Two hours to on an old computer without GPU support
- Two minutes on UNIX server with GPU support
4.1. Tensorflow
"[Tensorflow](https://en.wikipedia.org/wiki/TensorFlow) is a software library for machine learning and artificial intelligence"
- "Mainly used for training and inference of neural networks".
- "Developed by the Google Brain team for Google's internal use in research and production"
- "TensorFlow can be used in a wide variety of programming languages, including Python, JavaScript, C++, and Java"
We focus on Python.
4.2. GPUS, CUDA and cuDNN
On the UNIX server, the [Foswiki information resource](https://foswiki.ux.uis.no/bin/view/Info/WebHome), provides an overview of [Tungregning-servere](https://foswiki.ux.uis.no/bin/view/Info/TungRegning).
4.3. GPUS on the computation servers
GPU types residing on UNIX-servers for computation, gorina4 and gorina6.
- [NVIDIA Tesla P100](https://www.nvidia.com/en-in/data-center/tesla-p100/)
- [NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/)
Requires manual reservation.
4.4. GPUS on the SLURM cluster
GPU types residing on UNIX-servers gorina7, gorina8, gorina9.
- [NVIDIA A100 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/a100/)
Requires to use the SLURM-job-queue system on gorina11.
4.5. CUDA
"The [NVIDIA® CUDA® Toolkit](https://developer.nvidia.com/cuda-toolkit) provides a development environment for creating high-performance, GPU-accelerated applications."
- "Develop, optimize, and deploy your applications on GPU-accelerated embedded systems"
- "The toolkit includes GPU-accelerated libraries, debugging and optimization tools",…
4.5.1. cuDNN
"The [NVIDIA CUDA® Deep Neural Network library (cuDNN)](https://developer.nvidia.com/cudnn) is a GPU-accelerated library of primitives for deep neural networks.
4.5.2. TensorRT
[NVIDIA® TensorRT™](https://developer.nvidia.com/tensorrt) is an ecosystem of APIs for high-performance deep learning inference.
5. Demonstration
We will look at an example, [Training a neural network on MNIST with Keras](https://www.tensorflow.org/datasets/keras_example).
We will train a neural network to recognize handwritten numbers.
For use of tensorflow, we refer to [Install TensorFlow with pip](https://www.tensorflow.org/install/pip) from the [tensorflow web pages](https://www.tensorflow.org). Note that we are using `venv` instead of `conda` in the following example.
5.1. The numbers matters
It is important to be aware of which versions of Python, tensorflow, CUDA, cuDNN and TensorRT you will be using.
Check [the compatibility table](https://www.tensorflow.org/install/source*gpu) to ensure you are using compatible versions of tensorflow, CUDA and cudnn.
For example, if you know you will be using tensorflow 2.12.0, the table tells you that it is compatible with
- Python 3.8-3.11. We will be using Python 3.10.
- CUDA 11.8
- cuDNN 8.6
5.2. Identifying available libraries
Use the uenv-avail to see which librariies are available
You can filter using grep
uenv-avail | grep -i miniconda | grep -i 310
uenv-avail | grep -i cuda | grep -i 11.8 uenv-avail | grep -i cudnn | grep 11. | grep 8.6 uenv-avail | grep -i tensorrt | grep 11.x-8.6
returning cuda-11.8.0, cudnn-11.x-8.6.0 and TensorRT-11.x-8.6-8.5.3.1, respectively.
We will add these libraries to the `LDLIBRARYPATH` to make them available the environment applications.
5.3. Making the environment
The following documents the terminal commands for setting up the virtual environment and running the MNIST demo both on the manually reserved GPU #4 at gorina 6 and on the SLURM cluster.
5.3.1. Getting started
We have reserved GPU 4 on gorina6 and the The MNIST demo is copied to disk.
ssh go6 hostname
ls ~/bhome/MNIST_demo
From the [the compatibility table](https://www.tensorflow.org/install/source*gpu) we found the following combination of libraries
- Python 3.8-3.11. We will be using Python 3.10.
- CUDA 11.8
- cuDNN 8.6Cheking
We use uenv-avail to identify these
uenv avail | grep -i miniconda | grep -i 310 uenv-avail | grep -i cuda | grep -i 11.8 uenv-avail | grep -i cudnn | grep 11. | grep 8.6 uenv-avail | grep -i tensorrt | grep 11.x-8.6
5.3.2. Making the python virtual environment
First we use the `venv` command to create the virtual environment. The first `uenv`command tells the system which version of Python will be used.
uenv verbose miniconda3-py310 python -m venv ~/.venv/mnist_demo
The next step is to activate the environment and install the required python packages `tensorflow` and `tensorflowdatasets`. Note the commented line from a failed attempt to install the CUDA, cuDNN and TensorRT libraries directly.
ls #+NAME: actpip
source ~/bhome/.venv/mnist_demo/bin/activate pip install --upgrade pip #pip install nvidia-cudnn-cu11==8.6.0.163 tensorrt==8.5.3.1 tensorflow==2.12.0 pip install tensorflow==2.12.0 pip install tensorflow_datasets
Requirement already satisfied: pip in /mnt/beegfs/home/trygve-e/.venv/mnist_demo/lib/python3.10/site-packages (23.0.1)
Collecting pip
Using cached pip-24.3.1-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 23.0.1
Uninstalling pip-23.0.1:
Successfully uninstalled pip-23.0.1
Successfully installed pip-24.3.1
Collecting tensorflow==2.12.0
Using cached tensorflow-2.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting absl-py>=1.0.0 (from tensorflow==2.12.0)
Using cached absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow==2.12.0)
Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=2.0 (from tensorflow==2.12.0)
Using cached flatbuffers-25.1.21-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow==2.12.0)
Using cached gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting google-pasta>=0.1.1 (from tensorflow==2.12.0)
Using cached google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting grpcio<2.0,>=1.24.3 (from tensorflow==2.12.0)
Using cached grpcio-1.69.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting h5py>=2.9.0 (from tensorflow==2.12.0)
Using cached h5py-3.12.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.5.0-py3-none-any.whl.metadata (22 kB)
Collecting keras<2.13,>=2.12.0 (from tensorflow==2.12.0)
Using cached keras-2.12.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting libclang>=13.0.0 (from tensorflow==2.12.0)
Using cached libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting numpy<1.24,>=1.22 (from tensorflow==2.12.0)
Using cached numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow==2.12.0)
Using cached opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)
Collecting packaging (from tensorflow==2.12.0)
Using cached packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow==2.12.0)
Using cached protobuf-4.25.5-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Requirement already satisfied: setuptools in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from tensorflow==2.12.0) (65.5.0)
Collecting six>=1.12.0 (from tensorflow==2.12.0)
Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting tensorboard<2.13,>=2.12 (from tensorflow==2.12.0)
Using cached tensorboard-2.12.3-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow-estimator<2.13,>=2.12.0 (from tensorflow==2.12.0)
Using cached tensorflow_estimator-2.12.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting termcolor>=1.1.0 (from tensorflow==2.12.0)
Using cached termcolor-2.5.0-py3-none-any.whl.metadata (6.1 kB)
Collecting typing-extensions>=3.6.6 (from tensorflow==2.12.0)
Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting wrapt<1.15,>=1.11.0 (from tensorflow==2.12.0)
Using cached wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow==2.12.0)
Using cached tensorflow_io_gcs_filesystem-0.37.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Collecting wheel<1.0,>=0.23.0 (from astunparse>=1.6.0->tensorflow==2.12.0)
Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Collecting jaxlib<=0.5.0,>=0.5.0 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.5.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (978 bytes)
Collecting ml_dtypes>=0.4.0 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached ml_dtypes-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
INFO: pip is looking at multiple versions of jax to determine which version is compatible with other requirements. This could take a while.
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.4.38-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib<=0.4.38,>=0.4.38 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.4.38-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.4.37-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib<=0.4.37,>=0.4.36 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.4.36-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.4.36-py3-none-any.whl.metadata (22 kB)
Using cached jax-0.4.35-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib<=0.4.35,>=0.4.34 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.4.35-cp310-cp310-manylinux2014_x86_64.whl.metadata (983 bytes)
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.4.34-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib<=0.4.34,>=0.4.34 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.4.34-cp310-cp310-manylinux2014_x86_64.whl.metadata (983 bytes)
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.4.33-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib<=0.4.33,>=0.4.33 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.4.33-cp310-cp310-manylinux2014_x86_64.whl.metadata (983 bytes)
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.4.31-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib<=0.4.31,>=0.4.30 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.4.31-cp310-cp310-manylinux2014_x86_64.whl.metadata (983 bytes)
INFO: pip is still looking at multiple versions of jax to determine which version is compatible with other requirements. This could take a while.
Collecting jax>=0.3.15 (from tensorflow==2.12.0)
Using cached jax-0.4.30-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib<=0.4.30,>=0.4.27 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached jaxlib-0.4.30-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting scipy>=1.9 (from jax>=0.3.15->tensorflow==2.12.0)
Using cached scipy-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting google-auth<3,>=1.6.3 (from tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached google_auth-2.38.0-py2.py3-none-any.whl.metadata (4.8 kB)
Collecting google-auth-oauthlib<1.1,>=0.5 (from tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached google_auth_oauthlib-1.0.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting markdown>=2.6.8 (from tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached Markdown-3.7-py3-none-any.whl.metadata (7.0 kB)
Collecting requests<3,>=2.21.0 (from tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl.metadata (1.1 kB)
Collecting werkzeug>=1.0.1 (from tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached werkzeug-3.1.3-py3-none-any.whl.metadata (3.7 kB)
Collecting cachetools<6.0,>=2.0.0 (from google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached cachetools-5.5.1-py3-none-any.whl.metadata (5.4 kB)
Collecting pyasn1-modules>=0.2.1 (from google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached pyasn1_modules-0.4.1-py3-none-any.whl.metadata (3.5 kB)
Collecting rsa<5,>=3.1.4 (from google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached rsa-4.9-py3-none-any.whl.metadata (4.2 kB)
Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib<1.1,>=0.5->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached requests_oauthlib-2.0.0-py2.py3-none-any.whl.metadata (11 kB)
Collecting charset-normalizer<4,>=2 (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached charset_normalizer-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)
Collecting idna<4,>=2.5 (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting urllib3<3,>=1.21.1 (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached certifi-2024.12.14-py3-none-any.whl.metadata (2.3 kB)
Collecting MarkupSafe>=2.1.1 (from werkzeug>=1.0.1->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting pyasn1<0.7.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached pyasn1-0.6.1-py3-none-any.whl.metadata (8.4 kB)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<1.1,>=0.5->tensorboard<2.13,>=2.12->tensorflow==2.12.0)
Using cached oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB)
Using cached tensorflow-2.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (585.9 MB)
Using cached absl_py-2.1.0-py3-none-any.whl (133 kB)
Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Using cached flatbuffers-25.1.21-py2.py3-none-any.whl (30 kB)
Using cached gast-0.4.0-py3-none-any.whl (9.8 kB)
Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Using cached grpcio-1.69.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
Using cached h5py-3.12.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB)
Using cached jax-0.4.30-py3-none-any.whl (2.0 MB)
Using cached keras-2.12.0-py2.py3-none-any.whl (1.7 MB)
Using cached libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl (24.5 MB)
Using cached numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Using cached opt_einsum-3.4.0-py3-none-any.whl (71 kB)
Using cached protobuf-4.25.5-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
Using cached six-1.17.0-py2.py3-none-any.whl (11 kB)
Using cached tensorboard-2.12.3-py3-none-any.whl (5.6 MB)
Using cached tensorflow_estimator-2.12.0-py2.py3-none-any.whl (440 kB)
Using cached tensorflow_io_gcs_filesystem-0.37.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
Using cached termcolor-2.5.0-py3-none-any.whl (7.8 kB)
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Using cached wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77 kB)
Using cached packaging-24.2-py3-none-any.whl (65 kB)
Using cached google_auth-2.38.0-py2.py3-none-any.whl (210 kB)
Using cached google_auth_oauthlib-1.0.0-py2.py3-none-any.whl (18 kB)
Using cached jaxlib-0.4.30-cp310-cp310-manylinux2014_x86_64.whl (79.6 MB)
Using cached Markdown-3.7-py3-none-any.whl (106 kB)
Using cached ml_dtypes-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Using cached scipy-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (40.6 MB)
Using cached tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl (6.6 MB)
Using cached werkzeug-3.1.3-py3-none-any.whl (224 kB)
Using cached wheel-0.45.1-py3-none-any.whl (72 kB)
Using cached cachetools-5.5.1-py3-none-any.whl (9.5 kB)
Using cached certifi-2024.12.14-py3-none-any.whl (164 kB)
Using cached charset_normalizer-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (146 kB)
Using cached idna-3.10-py3-none-any.whl (70 kB)
Using cached MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Using cached pyasn1_modules-0.4.1-py3-none-any.whl (181 kB)
Using cached requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB)
Using cached rsa-4.9-py3-none-any.whl (34 kB)
Using cached urllib3-2.3.0-py3-none-any.whl (128 kB)
Using cached oauthlib-3.2.2-py3-none-any.whl (151 kB)
Using cached pyasn1-0.6.1-py3-none-any.whl (83 kB)
Installing collected packages: libclang, flatbuffers, wrapt, wheel, urllib3, typing-extensions, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server, six, pyasn1, protobuf, packaging, opt-einsum, oauthlib, numpy, MarkupSafe, markdown, keras, idna, grpcio, gast, charset-normalizer, certifi, cachetools, absl-py, werkzeug, scipy, rsa, requests, pyasn1-modules, ml_dtypes, h5py, google-pasta, astunparse, requests-oauthlib, jaxlib, google-auth, jax, google-auth-oauthlib, tensorboard, tensorflow
Successfully installed MarkupSafe-3.0.2 absl-py-2.1.0 astunparse-1.6.3 cachetools-5.5.1 certifi-2024.12.14 charset-normalizer-3.4.1 flatbuffers-25.1.21 gast-0.4.0 google-auth-2.38.0 google-auth-oauthlib-1.0.0 google-pasta-0.2.0 grpcio-1.69.0 h5py-3.12.1 idna-3.10 jax-0.4.30 jaxlib-0.4.30 keras-2.12.0 libclang-18.1.1 markdown-3.7 ml_dtypes-0.5.1 numpy-1.23.5 oauthlib-3.2.2 opt-einsum-3.4.0 packaging-24.2 protobuf-4.25.5 pyasn1-0.6.1 pyasn1-modules-0.4.1 requests-2.32.3 requests-oauthlib-2.0.0 rsa-4.9 scipy-1.15.1 six-1.17.0 tensorboard-2.12.3 tensorboard-data-server-0.7.2 tensorflow-2.12.0 tensorflow-estimator-2.12.0 tensorflow-io-gcs-filesystem-0.37.1 termcolor-2.5.0 typing-extensions-4.12.2 urllib3-2.3.0 werkzeug-3.1.3 wheel-0.45.1 wrapt-1.14.1
Collecting tensorflow_datasets
Using cached tensorflow_datasets-4.9.7-py3-none-any.whl.metadata (9.6 kB)
Requirement already satisfied: absl-py in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from tensorflow_datasets) (2.1.0)
Collecting click (from tensorflow_datasets)
Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting dm-tree (from tensorflow_datasets)
Using cached dm_tree-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting immutabledict (from tensorflow_datasets)
Using cached immutabledict-4.2.1-py3-none-any.whl.metadata (3.5 kB)
Requirement already satisfied: numpy in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from tensorflow_datasets) (1.23.5)
Collecting promise (from tensorflow_datasets)
Using cached promise-2.3-py3-none-any.whl
Requirement already satisfied: protobuf>=3.20 in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from tensorflow_datasets) (4.25.5)
Collecting psutil (from tensorflow_datasets)
Using cached psutil-6.1.1-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting pyarrow (from tensorflow_datasets)
Using cached pyarrow-19.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Requirement already satisfied: requests>=2.19.0 in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from tensorflow_datasets) (2.32.3)
Collecting simple-parsing (from tensorflow_datasets)
Using cached simple_parsing-0.1.7-py3-none-any.whl.metadata (7.3 kB)
Collecting tensorflow-metadata (from tensorflow_datasets)
Using cached tensorflow_metadata-1.16.1-py3-none-any.whl.metadata (2.4 kB)
Requirement already satisfied: termcolor in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from tensorflow_datasets) (2.5.0)
Collecting toml (from tensorflow_datasets)
Using cached toml-0.10.2-py2.py3-none-any.whl.metadata (7.1 kB)
Collecting tqdm (from tensorflow_datasets)
Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Requirement already satisfied: wrapt in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from tensorflow_datasets) (1.14.1)
Collecting array-record>=0.5.0 (from tensorflow_datasets)
Using cached array_record-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (692 bytes)
Collecting etils>=1.6.0 (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < "3.11"->tensorflow_datasets)
Using cached etils-1.11.0-py3-none-any.whl.metadata (6.5 kB)
Collecting fsspec (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < "3.11"->tensorflow_datasets)
Using cached fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting importlib_resources (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < "3.11"->tensorflow_datasets)
Using cached importlib_resources-6.5.2-py3-none-any.whl.metadata (3.9 kB)
Requirement already satisfied: typing_extensions in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < "3.11"->tensorflow_datasets) (4.12.2)
Collecting zipp (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < "3.11"->tensorflow_datasets)
Using cached zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB)
Requirement already satisfied: charset-normalizer<4,>=2 in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow_datasets) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow_datasets) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow_datasets) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from requests>=2.19.0->tensorflow_datasets) (2024.12.14)
Requirement already satisfied: six in /mnt/beegfs/home/trygve-e/.venv/mcpy310_2/lib/python3.10/site-packages (from promise->tensorflow_datasets) (1.17.0)
Collecting docstring-parser<1.0,>=0.15 (from simple-parsing->tensorflow_datasets)
Using cached docstring_parser-0.16-py3-none-any.whl.metadata (3.0 kB)
Collecting protobuf>=3.20 (from tensorflow_datasets)
Using cached protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (679 bytes)
Using cached tensorflow_datasets-4.9.7-py3-none-any.whl (5.3 MB)
Using cached array_record-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
Using cached etils-1.11.0-py3-none-any.whl (165 kB)
Using cached click-8.1.8-py3-none-any.whl (98 kB)
Using cached dm_tree-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (152 kB)
Using cached immutabledict-4.2.1-py3-none-any.whl (4.7 kB)
Using cached psutil-6.1.1-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287 kB)
Using cached pyarrow-19.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (42.1 MB)
Using cached simple_parsing-0.1.7-py3-none-any.whl (112 kB)
Using cached tensorflow_metadata-1.16.1-py3-none-any.whl (28 kB)
Using cached protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Using cached docstring_parser-0.16-py3-none-any.whl (36 kB)
Using cached fsspec-2024.12.0-py3-none-any.whl (183 kB)
Using cached importlib_resources-6.5.2-py3-none-any.whl (37 kB)
Using cached zipp-3.21.0-py3-none-any.whl (9.6 kB)
Installing collected packages: dm-tree, zipp, tqdm, toml, pyarrow, psutil, protobuf, promise, importlib_resources, immutabledict, fsspec, etils, docstring-parser, click, tensorflow-metadata, simple-parsing, array-record, tensorflow_datasets
Attempting uninstall: protobuf
Found existing installation: protobuf 4.25.5
Uninstalling protobuf-4.25.5:
Successfully uninstalled protobuf-4.25.5
Successfully installed array-record-0.6.0 click-8.1.8 dm-tree-0.1.8 docstring-parser-0.16 etils-1.11.0 fsspec-2024.12.0 immutabledict-4.2.1 importlib_resources-6.5.2 promise-2.3 protobuf-3.20.3 psutil-6.1.1 pyarrow-19.0.0 simple-parsing-0.1.7 tensorflow-metadata-1.16.1 tensorflow_datasets-4.9.7 toml-0.10.2 tqdm-4.67.1 zipp-3.21.0
deactivate
Now we see the procedure where we make a test to see if we are able to connect to the GPUs.
source ~/bhome/.venv/mcpy310_2/bin/activate
uenv verbose cuda-11.8.0 cudnn-11.x-8.6.0
uenv verbose TensorRT-11.x-8.6-8.5.3.1
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
At the end, all the GPUs on gorina6 are listed. We are only going to use GPU #4. So the following hides all other GPUs. So this is the appropriate setup we will use.
source ~/bhome/.venv/mnist_demo/bin/activate uenv verbose cuda-11.8.0 cudnn-11.x-8.6.0 uenv verbose TensorRT-11.x-8.6-8.5.3.1 export CUDA_VISIBLE_DEVICES="4" python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
5.3.3. Running the MNIST demo on gorina6
source ~/bhome/.venv/mnist_demo/bin/activate uenv verbose cuda-11.8.0 cudnn-11.x-8.6.0 uenv verbose TensorRT-11.x-8.6-8.5.3.1 export CUDA_VISIBLE_DEVICES="4" python -m tf_mnist
5.3.4. Running the MNIST demo on the SLURM cluster
ssh go11
cd ~/bhome/MNIST_demo
cat tf_mnist.sh
sbatch tf_mnist.sh
5.4. What about PyTorch
For example using PyTorch, see the [MNIST demonstration](https://gitlab.ux.uis.no/unix/gpu).