Troubleshooting
TensorFlow, Keras, PyTorch, and other deep learning packages are complex software systems, with many dependencies, and various problems can arise either when installing or using the software. Here we gather together some troubleshooting tips, along with some informational material presented elsewhere in this roadmap.
This is a current list of topics addressed here, with links to sections below:
- Inability to build suitable Python virtual environment using pip
- Inability to load cudart dynamic library
- Using only 1 GPU on Frontera
- Problems importing h5py for hdf5 functionality
- ImportError: cannot import name 'OrderedDict' from 'typing' (/opt/apps/intel18/python3/3.7.0/lib/python3.7/typing.py)
- Errors regarding urllib3 version
Inability to build suitable Python virtual environment using pip
After you use python to create and activate a new virtual environment, you will want to use the pip package manager to install things into that environment. Older pip versions (installed by default into your new environment) won't necessarily be able to access the latest versions of the packages that you would like to install. Therefore, it is usually best to upgrade the pip package inside your environment before installing other things, via:
Inability to load cudart dynamic library
With TensorFlow, you might see a big error message that starts with something like the following:
This indicates that TensorFlow is unable to locate a suitable CUDA runtime (cudart) library. TensorFlow aims to link dynamically to the cudart library, and on TACC systems, that library is managed by a cuda
module provided by the Lmod system. You might need to load a cuda module, or load a different version, in order to resolve this error. It is worth noting that TensorFlow will run even if it is unable to detect the cudart library, but it will run in cpu-only mode. See the page on CUDA and on TensorFlow on Frontera for more details.
Using only 1 GPU on Frontera
Even though each gpu-enabled node on Frontera contains 4 attached GPUs, your code will typically not make use of all of those GPUs without some work on your part (despite the fact that your allocation will be charged for having access to all of them). Approaches to addressing this are discussed in Running Jobs at TACC, Monitoring Jobs, and the topic on Distributed Training.
Problems importing h5py for hdf5 functionality
See the page on h5py / phdf5 for more information.
ImportError: cannot import name 'OrderedDict' from 'typing' (/opt/apps/intel18/python3/3.7.0/lib/python3.7/typing.py)
Recent releases of both PyTorch and TensorFlow make use of a Python library feature (OrderedDict
in the typing
module of the Python Standard Library) that was introduced in Python version 3.7.2. Therefore, if you are using an earlier version of Python, such as that provided by the TACC module python3/3.7.0, you will encounter this ImportError. In order to run using a Python version earlier than 3.7.2, therefore, you will need to roll back either PyTorch or TensorFlow to a suitable version that does not use this library feature.
Errors regarding urllib3 version
It is possible, when running either TensorFlow or PyTorch on Frontera, that you might encouter errors that indicate a problem with version incompatiblities associated the urllib3 and OpenSSL packages. That urllib3 package is brought into the virtual environments that we have instructed you to build as one of many additional dependencies. If you do encounter such an error, you might try modifying the pip install commands to explicitly indicate the installation of urllib3 with a specified version number, such as urllib3==1.26.14
. If you do encounter this error, it is likely arising from incompatibilities between the urllib3 package and the python versions available on Frontera through the Lmod system.