Parallel R
Steve Lantz, Christopher Cameron, Adam Brazier, Linda Woodard (original author)
Cornell Center for Advanced Computing
Revisions: 5/2025, 10/2021, 5/2018, 7/2014 (original)
This topic is a brief introduction to running R in parallel. It covers two basic strategies, multi-core processing and multi-node parallelism, both of which are applicable to TACC's Frontera and similar supercomputers.
Most R packages are designed to use a single core, but there are a number of ways to run R in parallel. The most obvious is to do embarrassingly parallel jobs, where you invoke the same R script with different inputs. Also, in a shared memory environment (one node), you can take advantage of built-in multithreaded functions in R, in an analogous fashion to using OpenMP. You can also use libraries such as Rmpi, pbdR or snow that are built on top of MPI. Of these, snow requires the least knowledge of MPI to use.
Objectives
After you complete this topic, you should be able to:
- Run parallel jobs in R
- Explain using multithreaded functions in R
- Describe using multicore processing in R on Frontera and similar HPC systems
- Explain how to use "snow" in an batch job
Prerequisites
This topic assumes the reader has no prior experience with R. The exercises and examples assume some familiarity with statistical analysis. Working through the exercises on Frontera or a similar HPC system requires a basic knowledge of Linux and the ability to access these systems via SSH.
Carrying out activities on Frontera will require an appropriate TACC allocation. As an alternative, some activities could be carried out on a local installation of R and others on another HPC resource with Slurm and R installed.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)