MySQL Parser Deficiencies

MySQL Parser Deficiencies

MySQL is a very popular database for many applications. But the parser has some important deficiencies that developers need to be aware of. The parser is inconsistent and not robust in the way that it enforces statement syntax. This may lead to ambiguity and application defects that are very difficult to find. I include one example below and link to another blog article that discusses another.

Inadequate Group By Parsing

The MySQL group by function does not require that all non-aggregated elements in the select list be present in the group by clause. This causes an ambiguity because if a given element in the select list is neither present in the group by clause, nor aggregated in the select list, then an arbitrary value is chosen (it actually seems to be the first value but I don’t know if that is always the case).  Consider this example table that contains the number of games a set of players has played in each of four seasons.

We can see that each player has played 100 games in each of the four seasons.

Now consider this simple query to get a sum of all the games that the player has played across all seasons:

As expected, we can see that each player has played 400 games.

But watch what happens if a novice developer places the season number field in the select list without placing it in the group by clause as well:

If the developer does not realize his error, he will think that each player played 400 games in season 1!

The parser should have prevented the query from being executed with the season_number in the select list when it was neither aggregated in some form or present in the group by clause.

This type of error is very difficult to find because the resulting data looks plausible unless you really know the data.

Function Syntax Inconsistency

The next example is related to the way that the parser enforces one statement syntax vs another. This involves the tolerance of whitespace (line breaks) when specifying the sum function vs the format function. This inconsistency does not concern me as much as the group by problem because it does result in an error that is visible to the developer, but it does illustrate a lack of consistency and robustness in the parser.

As MacLoghLainns Weblog describes, the SUM function requires that the opening parentheses be on the same line as the SUM keyword, while the FORMAT function does not have that restriction. View the blog article for more details.

Please reply back with comments or other issues that you have found that might help others avoid unexpected problems in their MySQL applications.

Visit www.budaconsulting.com for more information about Buda Consulting.

Query Rewrite: What is MySQL Doing To My Queries?

The Impact of Query Rewrite on MySQL Query Optimization

When processing a query, every relational database vendor takes a number of similar steps to parse, validate, and optimize a query. MySQL does the following:

  1. Validates the query syntax — does it use valid operators, are the clauses in the proper order, etc.
  2. Validates the query semantics — are the the objects (tables, views, columns) valid database objects in the database, etc.
  3. Transforms the query — rewrites the query so the optimizer can more easily choose the best path.
  4. Optimizes the query — determines the best execution path based on a large number of factors including statistics about the table and index data.

When analyzing a database performance problem, it helps to have an understanding of what is happening during step 3 and 4 above.

Occasionally I find a blog article that does a great job of explaining one these processes for a particular database vendor and I would like to share one that I found today. This excellent post was written by Guilhem Bichot and does a great job of illustrating the transformation (query rewrite) step and the impact of this step on the execution path selection.

This post describes the process in MySQL but the principles are similar for any database vendor.

Note that the example that Guilhem uses shows how the database itself may introduce a query hint while rewriting the query. I have mentioned in the past that I avoid hints because over time the usefulness and necessity of the hint may change and it may indeed hurt you. Of course, since the query rewrite process is executed in real time it is not subject to this problem.

Storing documents in a MongoDB database instead of the file system

Storing documents in a MongoDB database instead of the file system

We are currently working on an application that uses a mysql database. The application produces a large number of csv files that must be kept around for a few weeks, and then removed after being used in various ways by the users of the system.

We need to keep track of all of these files and present a list of the files to the user so they can download them to their local machine, or send them to other systems as needed.  We need to track when the file was created, who created it, and other information.

We considered a few alternatives for managing these files;  We could keep then on the file system and have scripts that keep the database synchoronized with the file system. Or we could load each file into a blob type and keep it in the mysql database. Either of these options will work fine if we are careful. But we are seeking something more.

Our research led us to MongoDB. An open source document oriented database.

MongoDB claims to bridge the gap between key-value stores which are typically used by high volume web applications that have simple data storage requirements, and traditional RDBMS database systems.

In addition to simple queries based on key-value pairs, you can use boolean expressions, regular expressions, set operations, and sorting and limiting. MongoDB also has the ability to grab documents directly from the file system using its gridFS functionality.

We have not implemented MongoDB for this project yet but it looks very promising.

I will send updates as we get closer.