The availability of many large and detailed datasets across the physical, biological, technological, and social sciences has given rise to great interest in "data science", an amalgamation of tools and techniques from computational science, statistics, machine learning, and other fields, aimed at making sense of complex datasets and in building predictive models from those data. The Python programming language has emerged as one of the key technologies supporting data science, both because its expressive syntax allows for the rapid development of complex algorithms and data analysis pipelines, and because a dedicated developer and user community have built a rich ecosystem of tools and libraries to facilitate research and production workflows in data science. Many different types of libraries are integrated into the Python ecosystem for data science, supporting data access and manipulation, numerical modeling, statistical operations, machine learning, and data visualization. The aim of this tutorial is to provide a broad overview of some of these tools, with links to more detailed information elsewhere, along with some concrete examples of how these tools can be used in various data science applications.


After you complete this roadmap, you should be able to:

  • Identify different forms of data in order to decide on what tools are best suited to their analysis
  • Use some of the key Python packages for data science
  • Import data from files and/or databases and store those data in suitable Python data structures
  • Clean messy or incomplete datasets
  • Augment and aggregate datasets, and perform analyses over subgroups of data
  • Visualize datasets using a variety of visualization techniques
  • Integrate models and data
  • Work with real-world datasets to develop data science skills

This tutorial assumes the reader has some working knowledge of general programming concepts, even if not directly with the Python programming language. The target audience is scientists and engineers who are already programming in Python, and are interested in using Python tools and packages to carry out various analyses of datasets. If additional introductory material about Python is needed, readers can consult An Introduction to Python as well as the documentation on the website.


Python can be run on computers from typical laptops up to the most powerful High Performance Computing (HPC) systems. Being able to run the code examples described in this roadmap will require either being able to install Python and related packages on your local machine, or having access to a managed system that has the relevant packages installed.

©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement