CUDA enabled installation of `aihwkit` using Docker

Berkehan Ercan July 6 2023

Installing CUDA enabled version of aihwkit is not easy, as it requires the user to build the project from source. The advanced installation guide on the Read the Docs page of aihwkit documents various ways of compiling, building, and installing the package. Having tried all methods listed, I found out that the CUDA-enable Docker image is the easiest and most reliable way to install aihwkit. The regular installation seems to be outdated, as the newer GCC and G++ versions (11.x.x) fail to compile the source code. To compile, one has to downgrade the main compilers on their PC, which is a bad idea in a UNIX-based OS. I have tried to downgrade in both conda virtual-env and python virtual-env, and both have failed to compile the source.

The Dockerfile provided in the repository also didn't work due to a few issues. Specifically, the file was still using the deprecated --install-option with pip, which caused errors with the newer versions of pip. However, I was able to modify the CUDA.Dockerfile. I changed the correct read-write permissions and used the setup.py method. Even though this method is discouraged in general, it worked. It's worth noting that these changes might be temporary and that future versions of the Dockerfile should seek to conform with the current best practices.

`CUDA.Dockerfile`


# Build arguments
ARG CUDA_VER=11.6.0
ARG UBUNTU_VER=22.04

# Download the base image
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu${UBUNTU_VER}
# you can check for all available images at https://hub.docker.com/r/nvidia/cuda/tags

# Install as root
USER root

# Install dependencies
RUN apt-get update && \
    DEBIAN_FRONTEND="noninteractive" apt-get install --yes \
    --no-install-recommends \
    bash \
    bash-completion \
    cmake \
    curl \
    git \
    libopenblas-dev \
    linux-headers-$(uname -r) \
    nano \
    python3 python3-dev python3-pip python-is-python3 \
    sudo \
    wget && \
    apt-get autoremove -y && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# Add a user `${USERNAME}` so that you're not developing as the `root` user
ARG USERNAME=ibm
ARG USERID=1000
ARG GROUPID=1000
RUN groupadd -g ${GROUPID} ${USERNAME} && \
    useradd ${USERNAME} \
    --create-home \
    --uid ${USERID} \
    --gid ${GROUPID} \
    --shell=/bin/bash && \
    echo "${USERNAME} ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/nopasswd

# Change to your user
USER ${USERNAME}
WORKDIR /home/${USERNAME}

ARG PYTORCH_PIP_URL=https://download.pytorch.org/whl/cu117

# Install python packages as your user
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir pybind11 scikit-build protobuf>=4.21.6 && \
    pip install --no-cache-dir torch torchvision torchaudio --extra-index-url ${PYTORCH_PIP_URL} && \
# Set path of python packages
    echo 'export PATH=$HOME/.local/bin:$PATH' >> /home/${USERNAME}/.bashrc

# Copy the source code inside to image and change to the source directory
COPY . ./aihwkit
WORKDIR /home/${USERNAME}/aihwkit

# Default value for NVIDIA RTX A5000, find your own GPU model and replace it
# use the k: https://developer.nvidia.com/cuda-gpus
ARG CUDA_ARCH=86

ENV USE_CUDA=ON
ENV RPU_BLAS=OpenBLAS
ENV RPU_CUDA_ARCHITECTURES=86

RUN sudo chmod -R 777 /home/berkehan/aihwkit
RUN echo "Detected CUDA_ARCHITECTURE is = ${CUDA_ARCH}"
RUN python3 setup.py install --user
# Build and install IBM aihwkit

Modify the CUDA.Dockerfile

I have installed the library on Linux Mint 21.1 with two RTX3090s, compute capability 86. Check your GPU compute capability number using nvidia-smi command. This number is crucial to the installation as we will pass this as a build option in dockerfile. Make sure that the nvidia drivers on your system are up-to-date. I have installed CUDA 11.6. Check your CUDA version with nvcc --version command. If you have a different CUDA installation on your system that you want to use,

First check which nvcc to find which installation you are currently using.
Export the path of your desired installation to the .bashrc file.
Source the .bashrc file.
Check the version and path with nvcc --version and which nvcc.

Beware that the CUDA Version stated in the nvidia-smi command is the newest version of CUDA that you can install on your system, not the version currently installed!

There are user-specific modifications required for the Dockerfile,

CUDA_VER: change this to your local machine CUDA version.
CUDA_ARCH: change to your GPU's compute capability number.
RPU_CUDA_ARCHITECTURES: change to your GPU's compute capability number.
RPU_BLAS=OpenBLAS: leave as default as OpenBLAS will be installed as a dependency, can be changed to IntelMKL if dependencies are configured correctly.
sudo chmod -R 777 /home/berkehan/aihwkit: change the path to your aihwkit folder path cloned from the GitHub Repository.
PYTORCH_PIP_URL=https://download.pytorch.org/whl/cu117: the version can be changed after the build but should not be necessary as cu117 is backwards compatible with 11.6.

Start the Build

We can start the build by executing the command:


sudo docker build --tag aihwkit:cuda
 --build-arg USERNAME=${USER}
 --build-arg USERID=$(id -u $USER)
 --build-arg GROUPID=$(id -g $USER)
 --no-cache
 --file CUDA.Dockerfile .

The build takes a good 1-2 hours based on your system specifications.

The build should exit successfully with:


Using /home/berkehan/.local/lib/python3.10/site-packages
Searching for cmake==3.26.4
Best match: cmake 3.26.4
Adding cmake 3.26.4 to easy-install.pth file
Installing cmake script to /home/berkehan/.local/bin
Installing cpack script to /home/berkehan/.local/bin
Installing ctest script to /home/berkehan/.local/bin

Using /home/berkehan/.local/lib/python3.10/site-packages
Finished processing dependencies for aihwkit==0.7.1
Removing intermediate container 09d1c338b637
 ---> 72722cd0a4cc
Successfully built 72722cd0a4cc
Successfully tagged aihwkit:cuda

Configure the Docker

We can now start the Docker that we have built using the command:


sudo docker run -it aihwkit:cuda /bin/bash

We will be greeted with an error message from CUDA:


==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

NVIDIA Container Toolkit

The problem is that the Docker image that we have created cannot access the GPUs in our system, as it is an independent image. To allow the Docker to execute instructions on the GPUs, we have to install the NVIDIA Container Toolkit.


distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey |
    sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \
    curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

If you are using Linux Mint as well, you have to change the distribution variable yourself as there is no package specifically for Mint. Because it is essentially Ubuntu, export the distribution variable as ubuntu18.04. Remove the distribution line from the command. Therefore:


export distribution=ubuntu18.04 && \
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey |
    sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \
    curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

After setting the repos up, we can:


apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Now we can run the Docker image again, but with the extra option:


sudo docker run --gpus all -it aihwkit:cuda

The --gpus all option tells the Docker image to use all the GPUs on the system. If you want to only use a single GPU, change all with the ID of your desired GPU, found in nvidia-smi (such as 0, 1, etc.).

Now we should see:


==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Test the Installation

We can test whether both aihwkit and PyTorch are installed correctly and can utilize CUDA:


sudo docker run -it --gpus all aihwkit:cuda /bin/bash
python3


Python 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> from aihwkit.simulator.rpu_base import cuda
>>> cuda.is_compiled()
True
>>>

CUDA enabled installation of aihwkit using Docker

CUDA.Dockerfile