Key Packages
There are several key libraries and packages in the Python ecosystem for data science, although the list is far from complete.
Basics
- python.org: main Python website
- Python documentation
- Python Package Index (PyPI): repository of packages
"Batteries Included" Python Distributions
Data Structures and Algorithms
- The Python Standard Library (select specific version on that page)
- NumPy: multi-dimensional arrays, array-level mathematical operations, linear algebra, random numbers, etc.
- SciPy: a Python-based ecosystem of open-source software for mathematics, science, and engineering
- Pandas: dataframes and series for representing tabular data, and rich set of operations for analyzing such data
- H5py: support for HDF5-formatted data files
- Dask: distributed arrays and dataframes, scheduling of distributed workflows
- Sympy: symbolic mathematics
- NetworkX: creation, manipulation, and study of the structure, dynamics, and functions of complex networks
- igraph: a network analysis library, written in C++, with front-ends in Python, R, Mathematica
Interpreters, Notebooks and Development Environments
- IPython: powerful interactive Python shell
- Jupyter: interactive notebooks integrating code, results, graphics and documentation
- Spyder: integrated development environment (IDE) combining editing, analysis, debugging, and profiling functionality
Data Visualization
- Matplotlib: 2D plotting library that makes easy things easy and hard things possible
- Seaborn: data visualization library based on matplotlib, providing a high-level interface for informative statistical graphics
- Bokeh: interactive visualization library that targets modern web browsers for presentation
- Plotly: interactive visualization library that targets modern web browsers for presentation
Relational Databases
- sqlalchemy: Python SQL toolkit and Object Relational Mapper, providing access to SQL databases
- sqlite3: Python interface to SQLite, a C library that provides a lightweight disk-based SQL database
Scientific Computing and Statistics
- Statsmodels: classes and functions for the estimation of statistical models, conducting statistical tests, and statistical data exploration
- Scikits: add-on packages for SciPy, hosted and developed separately and independently from the main SciPy distribution, providing more specialized functionality in a large number of topic areas
Machine Learning
- Scikit-learn: many different machine learning methods in Python
- TensorFlow: Deep Learning in Python (an end-to-end open source platform for machine learning)
- PyTorch: open source machine learning framework that accelerates the path from research prototyping to production deployment
- Keras: high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano
- Caffe: deep learning framework made with expression, speed, and modularity in mind
Image Processing
- Scikit-image: collection of algorithms for image processing
- Pillow: a fork of PIL, the Python Imaging Library
Natural Language Processing
- Natural Language Toolkit (NLTK): platform for building Python programs to work with human language data
- Spacy: industrial-strength natural language processing in Python
- Textblob: Python library for processing textual data
- python-Levenshtein: for computing string similarities and edit distances
Application-Specific APIs
- Tweepy: Python library for accessing the Twitter API
Systems for Big Data
- PySpark: Python interface to the Spark programming model
©
|
Cornell University
|
Center for Advanced Computing
|
Copyright Statement
|
Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)