Incorporating databases into your workflow II
More tips, tricks and considerations:
Programmatic access to databases
The data flow into and out of a database is often conducted by code, in the same way as reading from, and writing to, files is done. Exactly how this is to be done will depend on the RDMBS and the programming language in question, but in many cases (as in Python) there is a database interface/API specification which results in a consistent programmatic experience across different RDBMSs. In most cases, you will install a library and possibly some binary software acting as a driver, and then you will be able to access the database directly; there is too much variation between languages and RDBMSs to list or detail here, but it is normally simple enough to get set up.
One important note; most programmatic access will not persist changes to the database until the command is committed; if not committed, the changes will be rolled back, often a cause of considerable frustration and confusion. It is often possible to set the connection to autocommit, but in any case it is important to check the documentation of the interfacing code library.
Finally, as mentioned elsewhere, an Object Relational Model (ORM) can allow all operations with the database, including database design and creation, to be performed from code. Although this will remove the ability to tune interactions with the database, and may result in some design choices with which not everyone will agree, it can be a convenient approach. If database design is done independently, the ORM can be used to produce an object model of the database for programmatic use in an object-oriented language.
Graphical interfaces
A GUI can be very useful for some tasks, particularly designing and creating the database. It is important to appreciate how to perform these tasks in SQL, but often a visualization of the tables, columns, data types and primary keys, and the connections between the tables helps reduce mistakes; a visualization often makes query construction easier too. These tools are not as useful for INSERT, UPDATE or DELETE operations. All RDBMSs have GUIs available, including SQLite.
Tuning performance
When you're confident you have SQL which produces the right results, but performance is too slow, you need better performance; the process of improving this is typically called "tuning". Tuning might involve query improvement, modifications to the database such as adding more indexes or denormalization, or changes to the configuration of the RDBMS. However, there is generally no point in tuning the performance of SQL which produces the wrong results, so check first!
If the performance is good enough and you don't have reason to believe it will fail as the database scales up, leave it alone; in many cases, database performance is fine over the range of likely operating conditions. However, if you have large numbers of rows and/or many tables, operations such as JOIN can be rather slow, although "too slow" should always be considered against what is needed. Some concrete tips:
- You should ensure that the columns being searched the most are indexed. For WHERE clauses on columns, the importance of indexes increases with the fraction of values in that column which are unique. Primary keys or otherwise guaranteed-UNIQUE columns are always indexed, but the other side of a foreign key relationship—the foreign key column—will not be, by default. Check that all of the foreign key columns in problematic JOINs are indexed; this can be a huge performance improvement. Indexing comes with a cost, however; they take up space and will generally require extra computational effort to maintain after INSERT, UPDATE or DELETE operations.
- Check hardware use during slow queries, particularly CPU, RAM and disk input/output. Any of these can be a choke point and an upgrade to the host may help. This is easy to test in cloud or on VMs, but sometimes hardware upgrades are necessary; it is important to be convinced before spending the money, however.
- JOINs are expensive and normalization increases the number of tables and therefore makes JOINs more likely. Sometimes a little denormalization—say, having a duplicate column in more than one table, or having columns in one table that could be derived from consulting another—is a price worth paying for better performance. Beware the risk of anomalies, though!
- If using correlated subqueries, make sure you really do need them, as they can be costly.
- Sometimes using temporary tables, the syntax for which will vary between RDBMSs, can be very useful. You can load the temporary table with a query and then use it multiple times.
- Many RDBMSs have tools to help you optimize, like query hints or query analyzers. It is good to familiarize yourself with these.
Testing queries
It's important that your SQL produces the right results, particularly when decisions are informed or motivated by those results. At the same time, particularly for large data sets, it's hard to tell whether the results are correct. While testing queries can be something of an art, here are some tips:
- Looking at the top 50 or 100 rows can be helpful as a first check, but it's risky to assume they're fully representative of the whole
- If you have another copy of the results, such as when you might be improving performance of a correct, but too slow, query, you can pull
the returned results and compare them with code or run a SQL query to compare the two queries:
(Note that <table> can be queries which return tables; this works because UNION joins sets and EXCEPT subtracts results, but it will only work if both queries don't return duplicates.) - If you have a few important rows, or edge cases, you know should or should not be in a returned dataset, check for their presence. ORDER BY will often make it easier to find them
- Making a copy of the database with only edge case rows and a few standard rows is helpful to run tests of queries on. You can also use WHERE, if you are careful, to restrict the results returned to those of most interest
- If you know how many rows should be returned, COUNT() can be useful
- It's good to have multiple independent checks of important queries
As discussed in the designing databases topic, surrogate keys—an artificial primary key such as integer which autoincrements when a new row is added—is very common. In addition to making it easier to achieve 2NF, surrogate keys are stable against potential changes to natural keys such as serial numbers, perhaps arising from input error or some other collision (serial numbers being the same from two different manufacturers, etc.). Others prefer to use natural keys wherever possible, as they contain actual, semantic information, which often makes it easier to know the necessary value for foreign keys and might also make it more obvious when mistakes are being made in query. Another reason surrogate keys may be preferred is that many Object-Relational Models (ORMs) require or recommend a unique integer column so that if the ORM is used to model the database into an object model for programmatic use, such a column will need to be added if it doesn't already exist.
Output of data
In the simplest case, your query output is just displayed to the screen. This is useful for sanity-checking and for getting quick answers to questions about the data contained in the database, but many times you need to conduct more analysis on the data. The two basic options are to pull the directly into software as described above, or to write to a text file.
Output to text file is something most RDBMSs have functions to perform which are unique to that RDBMS. It can be useful to get a permanent record of the output and text files also serve as a perfectly usable input for many programs in most computer languages.
In general, the output will have one line of text per row and may also include a header row of column names. You will probably have to specify encoding (e.g., ASCII, Unicode) which will significantly affect the size of the text file, but writing Unicode strings from the database to an ASCII file may have unpredictable and undesirable results. Finally, you may be able to specify the column delimiters—tabs produce more readable output than commas, but either should be consumable by a spreadsheet program and either will allow missing values to be represented by a blank—and you may have to specify line terminators, which is particularly important if you work in a mixed *nix/Windows environment.