There are several key libraries and packages in the Python ecosystem for data science, although the list is far from complete.

Basics
"Batteries Included" Python Distributions
Data Structures and Algorithms
  • The Python Standard Library (select specific version on that page)
  • NumPy: multi-dimensional arrays, array-level mathematical operations, linear algebra, random numbers, etc.
  • SciPy: a Python-based ecosystem of open-source software for mathematics, science, and engineering
  • Pandas: dataframes and series for representing tabular data, and rich set of operations for analyzing such data
  • H5py: support for HDF5-formatted data files
  • Dask: distributed arrays and dataframes, scheduling of distributed workflows
  • Sympy: symbolic mathematics
  • NetworkX: creation, manipulation, and study of the structure, dynamics, and functions of complex networks
  • igraph: a network analysis library, written in C++, with front-ends in Python, R, Mathematica
Interpreters, Notebooks and Development Environments
  • IPython: powerful interactive Python shell
  • Jupyter: interactive notebooks integrating code, results, graphics and documentation
  • Spyder: integrated development environment (IDE) combining editing, analysis, debugging, and profiling functionality
Data Visualization
  • Matplotlib: 2D plotting library that makes easy things easy and hard things possible
  • Seaborn: data visualization library based on matplotlib, providing a high-level interface for informative statistical graphics
  • Bokeh: interactive visualization library that targets modern web browsers for presentation
  • Plotly: interactive visualization library that targets modern web browsers for presentation
Relational Databases
  • sqlalchemy: Python SQL toolkit and Object Relational Mapper, providing access to SQL databases
  • sqlite3: Python interface to SQLite, a C library that provides a lightweight disk-based SQL database
Scientific Computing and Statistics
  • Statsmodels: classes and functions for the estimation of statistical models, conducting statistical tests, and statistical data exploration
  • Scikits: add-on packages for SciPy, hosted and developed separately and independently from the main SciPy distribution, providing more specialized functionality in a large number of topic areas
Machine Learning
  • Scikit-learn: many different machine learning methods in Python
  • TensorFlow: Deep Learning in Python (an end-to-end open source platform for machine learning)
  • PyTorch: open source machine learning framework that accelerates the path from research prototyping to production deployment
  • Keras: high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano
  • Caffe: deep learning framework made with expression, speed, and modularity in mind
Image Processing
  • Scikit-image: collection of algorithms for image processing
  • Pillow: a fork of PIL, the Python Imaging Library
Natural Language Processing
Application-Specific APIs
  • Tweepy: Python library for accessing the Twitter API
Systems for Big Data
  • PySpark: Python interface to the Spark programming model
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement