Key Packages
There are several key libraries and packages in the Python ecosystem for data science, although the list is far from complete.
Basics
- python.org: main Python website
- Python documentation
- Python Package Index (PyPI): repository of packages
"Batteries Included" Python Distributions
Data Structures and Algorithms
- The Python Standard Library (select specific version on that page)
- NumPy: multi-dimensional arrays, array-level mathematical operations, linear algebra, random numbers, etc.
- SciPy: a Python-based ecosystem of open-source software for mathematics, science, and engineering
- Pandas: dataframes and series for representing tabular data, and rich set of operations for analyzing such data
- H5py: support for HDF5-formatted data files
- Dask: distributed arrays and dataframes, scheduling of distributed workflows
- Sympy: symbolic mathematics
- NetworkX: creation, manipulation, and study of the structure, dynamics, and functions of complex networks
- igraph: a network analysis library, written in C++, with front-ends in Python, R, Mathematica
Interpreters, Notebooks and Development Environments
- IPython: powerful interactive Python shell
- Jupyter: interactive notebooks integrating code, results, graphics and documentation
- Spyder: integrated development environment (IDE) combining editing, analysis, debugging, and profiling functionality
Data Visualization
- Matplotlib: 2D plotting library that makes easy things easy and hard things possible
- Seaborn: data visualization library based on matplotlib, providing a high-level interface for informative statistical graphics
- Bokeh: interactive visualization library that targets modern web browsers for presentation
- Plotly: interactive visualization library that targets modern web browsers for presentation
Relational Databases
- sqlalchemy: Python SQL toolkit and Object Relational Mapper, providing access to SQL databases
- sqlite3: Python interface to SQLite, a C library that provides a lightweight disk-based SQL database
Scientific Computing and Statistics
- Statsmodels: classes and functions for the estimation of statistical models, conducting statistical tests, and statistical data exploration
- Scikits: add-on packages for SciPy, hosted and developed separately and independently from the main SciPy distribution, providing more specialized functionality in a large number of topic areas
Machine Learning
- Scikit-learn: many different machine learning methods in Python
- TensorFlow: Deep Learning in Python (an end-to-end open source platform for machine learning)
- PyTorch: open source machine learning framework that accelerates the path from research prototyping to production deployment
- Keras: high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano
- Caffe: deep learning framework made with expression, speed, and modularity in mind
Image Processing
- Scikit-image: collection of algorithms for image processing
- Pillow: a fork of PIL, the Python Imaging Library
Natural Language Processing
- Natural Language Toolkit (NLTK): platform for building Python programs to work with human language data
- Spacy: industrial-strength natural language processing in Python
- Textblob: Python library for processing textual data
- python-Levenshtein: for computing string similarities and edit distances
Application-Specific APIs
- Tweepy: Python library for accessing the Twitter API
Systems for Big Data
- PySpark: Python interface to the Spark programming model