TL;DR: Build a docker image, mount your code as a volume, live a happy life.
As a CS PhD student doing machine learning research, I usually have multiple projects going at once. These projects require different versions of libraries (e.g. PyTorch) or system setups (e.g. CUDA versions), so they cannot conflict with each other. I do development locally, but run most train and test jobs on the compute cluster, so I need consistency between the two, with different dataset locations handled at the environment level, not the client code level. These environments are also regularly subject to change; a package needs to be easy to add or update without being left in a broken state (e.g. due to bad uninstall routine), and if the env is broken, it’s easy to roll back to a working one. Once changed, they need to be easy to synchronize with other systems without a painful rebuild process or the possibility of one system being left in a broken state. I also want the environments to be scrutable; the descriptions of how they were constructed need to be human readable and tracked by source control.
I need environments that are:
For these reasons, I do all of my development work inside Docker containers.
Docker is a system for building and running containers. Each
container holds a full system image – you start with a base system image
(e.g. an Ubuntu image
with CUDA and OpenGL preinstalled), and then you modify it via a
series of bash
commands laid out in a
. Once the container image is built, you can
interactively run programs (such as your code) inside using the
installed libraries. Your code and other folders such as data folders
can be dynamically mounted into the container, allowing for live editing
of your code from either inside the container, or outside on the base
system (e.g. from an open text editor). Built images can be uploaded to
DockerHub, where they can be
pulled down to another machine (e.g. the compute cluster), ensuring
binary identical environments.
The following steps assume that you have docker
installed on your base system with NVidia GPU support (sometimes called
), and your base system has an NVidia GPU
with a GPU driver with CUDA version support at least
as great as your targeted CUDA version. Note: driver
CUDA support is not the same as having CUDA installed on your
base system. You can check the version of CUDA your driver supports with
the output nvidia-smi
, e.g.
$ nvidia-smi
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
means the base system supports CUDA versions 12.0 and earlier (including CUDA 11.X and 10.X)
is the Dockerfile
I base my environments upon — it uses
an NVidia provided base image with CUDA 11.3 and OpenGL on Ubuntu 20.04.
The Dockerfile
performs some system setup, then installs
(a minimal installer for the commonly used
package manager), which it uses to install other
standard packages: Python 3.10, PyTorch 1.12.1, and Open3D 0.17.
FROM nvidia/cudagl:11.3.0-devel-ubuntu20.04
SHELL ["/bin/bash", "-c"]
# Set the timezone info because otherwise tzinfo blocks install
# flow and ignores the non-interactive frontend command 🤬🤬🤬
RUN ln -snf /usr/share/zoneinfo/America/New_York /etc/localtime && echo "/usr/share/zoneinfo/America/New_York" > /etc/timezone
# Core system packages
RUN apt update --fix-missing
RUN apt install -y software-properties-common wget curl gpg gcc git make apt-utils
# Install miniconda to /miniconda
RUN curl -LO
RUN bash -p /miniconda -b
RUN rm
ENV PATH=/miniconda/bin:${PATH}
# Set standard environment variables so any libraries with CUDA support build with CUDA
# support for all the common NVidia architectures
ENV TORCH_CUDA_ARCH_LIST="Ampere;Turing;Pascal"
RUN conda update -y conda
RUN conda install numpy python=3.10 pytorch==1.12.1 torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
RUN pip install open3d==0.17
# Add the project to the PYTHONPATH so that we can import modules from it
# Add the modified bashrc to the container so we get a nice prompt and persistent history
COPY bashrc /root/.bashrc
# Set the working directory to the project directory, where we will mount the repo
WORKDIR /project
To build this Dockerfile
inside the example
project folder, run
docker build docker/ -t research_dev_env
This command uses the docker/
subfolder as a build
context, meaning the build process has access to all files inside the
folder (e.g. to COPY
into the image, in the case of
), and defaults to using the file named
inside this context. The built container is
tagged as research_dev_env
with the -t
To make the container distributable to any system, I created an
example repo under my account on DockerHub. Note that the commands
below are for my account, and require you to have authenticated your
DockerHub account (i.e. via docker login
To tag the locally built image with the remote repo:
docker image tag research_dev_env kylevedder/research_dev_env
To push:
docker image push kylevedder/research_dev_env
I use the following bash script to run an interactive session in the
container (kylevedder/research_dev_env:latest
xhost +
touch `pwd`/docker_history.txt
docker run --gpus=all --rm -it \
-v `pwd`:/project \
-v `pwd`/docker_history.txt:/root/.bash_history \
-v /tmp/.X11-unix:/tmp/.X11-unix \
--privileged \
this script enables everything we want inside the container,
including interactive sessions (-it
), access to all system
GPUS (--gpus=all
), mounting our codebase as
inside the container, setting the persistent bash
history file, and preparing X11 for headed viewing (a shockingly
non-trivial endeavor to setup).
Upon launch, you’re given a bash session located in
that lets you run your code; the demo program
maps a random unit sphere point cloud through
a randomly initialized two layer MLP, producing a mapped sphere; the
results are interactively visualized in 3D with Open3d, with the input
shown in red and the result shown in blue.
Docker requires root access to run and makes changes on the mounted file system as root. This is fine for your local development machine, but is dangerous in a shared cluster.
Penn’s SLURM cluster supports running inside containers by converting
them from traditional container/OS images into unprivileged sandboxes
(using NVidia’s enroot
) and
then running those sandboxes (using NVidia’s pyxis
file with enroot
creates a .sqsh
file, a
self-contained sandbox image of your docker container. To do this on a
SLURM cluster using srun
, run
srun enroot import docker://kylevedder/research_dev_env:latest
which will produce a
file in the
directory the srun
command was launched in
fileThe pyxis
plugin allows for .sqsh
files to
be used as containers. An example srun
job using this
srun --container-image=/home/kvedder/kylevedder+research_dev_env+latest.sqsh --container-mounts=/home/kvedder/my_dataset:/dataset/,/home/kvedder/code/my_project:/project bash -c "python"
This will run python
in the container using
the codebase /home/kvedder/code/my_project
mounted to
, and have access to
mounted at /dataset
Further mounts can be added with the same comma separated syntax.
environment files?conda
environment files are NOT fully
reproducible; many packages have implicit dependencies on underlying
system packages. Even if they have no underlying system dependencies,
environments can still fail to be reproduced if a
single package at the pegged version becomes unavailable
(e.g. deprecation). Docker images are fully self-contained, meaning at
any point in the future the environment can be pulled and it will be
binary equivalent to the environment used to develop and test
for your paper.
environments are also easy to break — uninstalling
or upgrading a package can require a full environment rebuild, and when
testing package versions for issues, there’s not a good way to go back
to a “known good” state. Docker images provide known, deterministic
state, along with a build cache for easy switching between line
files are also inscrutable, as they do
not distinguish between packages intentionally installed versus
implicitly installed. They are also difficult to keep synchronized
between systems, as it requires either fully rebuilding the environment
each time it’s updated, or risk desynchronization because of faulty
package uninstall logic.
This isn’t a question, and I find editing a Dockerfile
and rebuilding to be 100x less frustrating than futzing around with a
environment during package version management. I have
had to wipe and rebuild conda
environments at least 50
times because something broke during an edit while trying to resolve
multi-package compatibility issues. Docker’s ability to revert to a
known good state has saved me more times than I can count.