Cornell Virtual Workshop

Vista Quickstart: AI and ML with GPUs

Back

AI and ML applications often require powerful GPUs to train models and run inference on scientific data. On Vista, you may request to run jobs on the gh or gh-dev nodes, each of which contains a single NVIDIA H200 GPU. In this quickstart, we will walk through setting up a virtual environment and installing common AI/ML Python libraries, such as PyTorch and TensorFlow.

Set up a Python virtual environment (venv) to manage Python packages and their dependencies. Ideally, use a separate environment for each workflow, though this is not strictly required.

PyTorch without a Virtual Environment

If you just want to use PyTorch without setting up a virtual environment, TACC provides minimal instructions at Running PyTorch (Single Node). However, it's generally useful to set up a virtual environment for additional Python packages.

On Vista, the default installation of Python at /usr/bin/python has reached end-of-life, so it is recommended to load a newer Python version by loading a module. Note that the python3/3.11.8 module requires loading a compatible gcc module, as outlined in module spider python3/3.11.8.

We will also load cuda from a module. We recommend cuda/12.8, as it's the latest CUDA version supported by the current version of the GPU driver.

Choose a location and create a virtual environment:

TACC's File Systems

TACC recommends using the $SCRATCH file system for all parallel and high-intensity file operations, including those involving Python virtual environments. Be aware that files in $SCRATCH may be subject to purge if access time is more than 10 days old: see TACC's Scratch File System Purge Policy.

Activate the environment you created:

If the virtual environment is created successfully, you will see:

This means the virtual environment is activated. All Python packages will now be installed in this environment. To deactivate the virtual environment, the command is deactivate. Note that you must do module load before activating the environment; loading or unloading modules after activating the virtual environment will override the default python or python3 path, which could lead to errors.

As a general recommendation, you should upgrade the Python Package Manager (pip) as your immediate first step:

For the following package instructions, assume a new virtual environment is used for each installation.

To install PyTorch (https://pytorch.org), enter the following command:

To verify the installation, open a Python interactive session, then enter:

If there are no errors in the import, this installation is successful. If you're on a GPU node and want to check if a GPU device is available to PyTorch:

TACC provides a similar set of instructions and additional examples for multi-node jobs at Machine Learning on Vista.

TensorFlow is currently tested only for CPU support on Vista. To install TensorFlow (https://www.tensorflow.org/install), enter the following command:

To verify the installation, open a Python interactive session, then enter:

If there are no errors in the import, this installation is successful.

To install JAX (https://jax.readthedocs.io/en/latest/installation.html), enter the following command:

Due to a conflict with the existing CUDA libraries, we need to unload cuda:

Open a Python interactive session with the commmand python, then to verify the package, enter:

If there are no errors in the import, this installation is successful. If you're on a GPU node and want to check if a GPU device is available to JAX:

To install CuPy (https://docs.cupy.dev/en/stable/install.html), enter the following command:

To verify the installation, open a Python interactive session, then enter:

If there are no errors in the import, this installation is successful. Unlike other packages, CuPy requires a CUDA driver (in other words, a GPU) present to use the package. If you try to create an array with CuPy while not being on a GPU node:

On a GPU node, check if GPU devices are available to CuPy:

To install Keras (https://keras.io/getting_started/), first install PyTorch, TensorFlow, or JAX from the previous section. Then enter the following command:

Open a Python interactive session to verify the installation. You must specify the backend before importing Keras:

If there are no errors in the import, this installation is successful. If you're on a GPU node and want to check if a GPU device is available to Keras:

To install Transformer from HuggingFace (https://huggingface.co/docs/transformers/en/installation), first install PyTorch from the previous section. Then enter the following command:

To verify the installation, open a Python interactive session, then enter:

If there are no errors in the import, this installation is successful. If you're on a GPU node and want to check if a GPU device is available to Transformers, check by importing PyTorch:

TACC provides a similar set of instructions and additional examples for multi-node jobs at Setting Up Transformers with Accelerate.

To install LangChain (https://python.langchain.com/docs/get_started/installation), enter the following command:

To verify the installation, open a Python interactive session, then enter:

If there are no errors in the import, this installation is successful.

If you intend to run local LLMs, PyTorch may be required. Install PyTorch as instructed above and check if a GPU device is available to PyTorch:

Vista Quickstart: AI and ML with GPUs

Create a Python Virtual Environment

PyTorch

TensorFlow (CPU only)

JAX

CuPy

Keras

HuggingFace

LangChain