The Lost Science Of Relational Algebra And Set Theory

I originally wrote this post in 2011. Much has changed in the database technology landscape since then. Big Data Technologies such as Hadoop have gone mainstream, cloud technology and is changing how and where we think about hosting our databases.

But relational databases are still relied upon as the best option for rich transactional data.

So, since this technology is still the foundation of our mission critical systems, we should understand how to take advantage of one of the foundational elements of relational technology: The Set.

The SQL language (Structured Query Language) was built upon relational algebra. This rigorous approach to query definition is largely about set theory. This post is not a detailed technical discussion of relational algebra or set theory, instead it is about the way that relational databases are often misused.

The purpose of this article is to discuss the central theme of relational database technology and one of its greatest strengths. One that is often overlooked by those practicing Oracle Database Design or SQL Server Database Design and Database Development. I am talking about Set Theory. Relational Databases like Oracle and SQL Server are built and optimized to process sets of rows, as opposed to individual rows. Many application developers, even those that use these relational tools, struggle to think beyond the individual row. That is why the major relational database vendors have created very powerful procedural languages such as PL/SQL and T/SQL.

In many cases, developers use these tools to step row by row through a dataset (by using cursors) because they may not understand how the set operators work. This approach leads to unnecessary development and higher maintenance costs, as well a poor performance.

There are definitely times when a procedural process is necessary. But often times there are set-based alternatives that would be more efficient and much easier to develop.

In this post, I will focus on three core set operators: Union, Intersect, and Difference.

First some definitions:

Given two sets, Set A and Set B

Union:  All records from set A and all records from Set B.  If a record exists in both sets, it will only appear once in the Union. (Areas A, B, and C in figure 1).

Intersection: The unique set of records that exist in both set A and set B (Area C in figure 1).

Difference: The difference between Set A and Set B are all the records in Set A except those that also exist in Set B. (Area A in figure 1).

Vendor Differences

Relational databases implement these operators in different ways, but they all provide a relatively simple way to combine and compare similar sets of data. Oracle has the Union, Intersect, and Minus operators. SQL Server has Union, Intersect, and Except operators.

MySql has the ability to perform these operations as well, but it is more complex. For example, in order to do a difference operation, you must use a not exists or not in operator, resulting in a more complex sql statement.

Example

Lets examine how Oracle implements each of these set operations with a simple example.

This post is intended to discuss the concepts so I did not include the data and the actual query results in the post. But you can download the script to create and populate the tables with test data and run the queries here: set_tables_sql

Suppose you collect bank account events (debits, credits) from multiple sources. You place them into one common table, but also maintain the original source records in separate tables for historical purposes.  The original source records never change, but the events in the common table can be modified as necessary by the users.

Now suppose that occasionally you need to compare the transactional data in the common table to the original source data to see which rows have been changed. This is very easy using set operators.

The tables that we will use for this example follow. I used different column names in each table to illustrate that the column names do not need to be the same in each set that you are comparing. However, the number of columns in each query and the data types in each query must be the same.

Table Definitions

CREATE TABLE Event
(
Event_Id NUMBER,
Event_Name VARCHAR2(30),
Event_Description VARCHAR2(255),
Data_Source_location VARCHAR2(30),
Event_Date DATE
);

CREATE TABLE Event_Source_1
(
Event_Id_Orig NUMBER,
Event_Name_Orig VARCHAR2(30),
Event_Description_Orig VARCHAR2(255),
Data_Source_location_Orig VARCHAR2(30),
Event_Date_Orig DATE
);

CREATE TABLE Event_Source_2
(
Event_Id_Orig NUMBER,
Event_Name_Orig VARCHAR2(30),
Event_Description_Orig VARCHAR2(255),
Data_Source_location_Orig VARCHAR2(30),
Event_Date_Orig DATE
);

Example 1 — Union: Now suppose you needed to display all event names that appear in Event Source 1 and Event Source 2. The Union operator will display records from both tables, but records appearing in both tables will only appear once (unless the union all operator is specified, in which case duplicates will be displayed).

SELECT Event_Name_Orig FROM Event_Source_1
UNION
SELECT Event_Name_Orig FROM Event_Source_2;

Example 2 — Intersection: Now suppose you needed to display only events from Source 1 that have remained unchanged in the Event table. This can be done with an intersection between Event and Event_Source_1.

SELECT Event_Name,Event_Description,Data_Source_Location FROM Event
INTERSECT
SELECT Event_Name_Orig,Event_Description_Orig,Data_Source_Location_Orig FROM Event_Source_1;

Example 3  —  Difference: Now suppose you want to know all Data Source Locations that appear in the original Data Source 2 data but not in the original Data Source 1 data. This can be done by using the difference operation, implemented with the Minus operator by Oracle. This will take all the records from one set and subtract those that also exist in another set.

SELECT Event_Name_Orig,Event_Description_Orig,Data_Source_Location_Orig FROM Event_Source_1
MINUS
SELECT Event_Name_Orig,Event_Description_Orig,Data_Source_Location_Orig FROM Event_Source_2

Database Design Considerations

These powerful operators can be used to reduce or eliminate the need for cursors in many cases. The usefulness of these operators is dependent on sound database design and a well-normalized table structure. For example, a table that has repeating columns designating the same data element (as opposed to using multiple rows) will render these operators much less useful.

Conclusion

With careful database design and a good understanding of the Set management tools provided by the relational vendors, we can simplify and speed development and reduce maintenance costs. Lets think in terms in sets and get the most out of our relational database investment!

If you would like to discuss set theory or relational database design, please give me a call at (888) 809-4803 x 700 and if you have further thoughts on the topic, please add comments!

If you enjoyed this article please like and share!