The relational model
A historical note
The relational model for databases dates back to the late 1960s, when computer scientist E. F. Codd developed the model while working at IBM. The original nomenclature of the model as written is that data units are stored in tuples, with each tuple consisting of attributes; data units with the same set of attributes are stored in relations. Although you do sometimes see these definitions used, in practice the tuples are normally called rows, the attributes are called columns or fields, and a relation is then familiarly called a table. We will use these more common terms as they will be more familiar to users, and are also entities in the SQL programing language.
Storing the data
The idea of data stored in columnar rows in a table is a familiar one! In a relational database as typically constructed, there will be multiple tables which are related to each other; the specification of how to decide on the collection of these tables, and their interrelationships, is called normalization (discussed later) and is focused on maintaining consistency of the contained data under normal operations of inserting, updating and deleting data. In fact, the ability to maintain consistency of the data—applying constraints so that, for example, if recording measurement from a particular instrument, that instrument has to also be exist in the database—is one of the reasons to use relational databases over, for example using a spreadsheet to store it all.
A specification achieved by most relational databases is ACID, which stands for Atomicity, Consistency, Isolation and Durability and applies to database operations, called transactions. These properties relate to guarantees of system robustness and behaviour in light of failures during operations, including power failures:
- Atomicity: either an operation succeeds entirely or it fails and will be rolled back. For example, if a transaction involves writing to a table and one value cannot be inserted, the whole transaction will fail (no data will be put into the table).
- Consistency: the constraints preserving consistency—for example, a rule that requires that an observation has to have been made by a person also listed in the database—will be obeyed at all times.
- Isolation: transactions do not affect each other; they are isolated
- Durability: once a transaction is committed, it is permanent. This typically means that the data are written to permanent storage
Accessing the data with SQL
A data store would not be very useful without a method for extracting the data, preferably subsets of data based on criteria which limit the data returned. With relational databases, we do this by writing queries in SQL (Structured Query Language), either pronounced as "S. Q. L" or "sequel". The SQL syntax may be unfamiliar to start, but it becomes fairly intuitive with experience. SQL allows for sophisticated query structures and also for some calculations to be performed at query time, although there may be a balance to strike between query processing effort on the database—which is often an expensive shared resource—and a coarser-but-quicker query followed by subsequent offline processing on the retrieved dataset.
It may be possible to avoid writing SQL queries at all, using tools such as object-relational models (ORMs). Although the use of these tools is largely outside the scope of this material, they are reasonably commonly used. Additionally, although there is an ANSI standard for SQL it is not universally implemented in each Relational Database Management System (RDBMS) and most RDBMSs have additional features which are departures from the ANSI standard. We will focus on the widely-implemented elements of SQL, and note common RDBMS variations from them.
Relational Database Management Systems
A Relational Database Management System (RDBMS) is the software operating the relational database, responsible for ensuring that the data are kept consistent and compiling and executing SQL database commands. There are rather many RDBMSs, the most common of which include:
- Oracle
- Microsoft SQL Server
- MySQL/MariaDB
- IBM DB2
- PostgreSQL
- Sybase
- SQLite
Several of these RDBMSs are open-source; the proprietary ones can be very expensive! Of particular interest is SQLite, which is a simple but performant open-source database library with bindings for many programming languages including Python, in addition to availability of simple GUIs; SQLite is a good choice for starting out with relational databases, although any RDBMS should work just fine. The choice of which RDBMS to use in production will often be driven by cost and performance considerations as well as the availability of advanced features including availability of special datatypes, scalability, complexity of interfacing with a given framework, and other issues.