What Recommendations Should NOT Be Done!?

This article goes along similar lines to the piece I did on the security of MySQL. I came across another blog that asked about programming tips that really should not be followed in the real world. I thought it might be interesting to get some ideas of what areas of MySQL or database theory that is taught as "religion" at University or even in starting positions that just don't work in the real world in some situations.

I will get the ball rolling by adding a few points that are often drummed into people, but don't work in all situations.

1. Normalisation: This is one that is always pushed right from the start and does offer benefits in many situations. However, when you start to look at much larger datastores such as data warehousing, reporting and analytics and streaming data, normalisation will often become a hinderence to the working of the database. Joing multiple tables that can be hundreds of millions of records each is not a good option for performance, so denormalisation is considered as a viable alternative in these cases.

2. Use the Query Cache: Caching is often seen as a great performance improvement and is considered as one feature that should always be one. However, with the MySQL query cache, the performance benefits can vary depending on the data set that is being used and the type of queries that are commonly run. For example, having a query cache active for data warehousing is not good as it attempts to fill the cache, gets too full and aborts to fetch from the table anyway. It is much easier to turn off the query cache and access the table directly the first time. Similarly, if you have a high ratio of updates, the cache will continually be cleared and not usable for real caching benefits.

Are there any other points that people have come across that are often seen as necessary or offer great benefits that in real situations are actually a problem?

Okay, obviously we can't

Okay, obviously we can't just group it all together and say all tables in a data warehouse should be denormalised. A "typical" data warehouse is a bit anomalous as there are a number of different models for the data warehouse such as star schemas, but also snowflake schemas. The denormalisation comes into play when looking at how you are defining the fact and the dimension tables.

A simple example is the time dimension. Most data warehouses will have one and the base components are year, month and day. However, we want word names as well as integers, so in normalised form, it would be:

+----------+ +-----------+ +----------+
| Day | | Month | | Year |
+----------+ +-----------+ +----------+
|id | |id | |id |
|day | |month | |year |
|dayname | |monthname | +----------+
+----------+ +-----------+

However, when looking at a time dimension you would have it as:

+----------+
| Time |
+----------+
|id |
|day |
|dayname |
|month |
|monthname |
|year |
+----------+

Why is that? Because there will be many other fields in the time dimension often including quarters, fiscal years, and other specific time periods. Also, have it denormalised mean we add some small overhead in the dimension, but it means joining only one dimension to the fact (assuming simple query here) rather than with 3 separate tables.

As was mentioned, if the dimension is small enough, it may not make that much difference, but it can make a difference.

This is just an example with data warehouses, it becomes even more important when using data streams where any joins can be detrimental to the performance of the system.

The second part of the question was if the denormalisation was to define the use of star schema or snowflake models, that is, do we denormalise the fact table? One of the methods most commonly used in denormalisation is that of aggregation. By adding columns which contains sums, averages, minimums and maximums we are denormalizing the model. This can sometimes be done in the actual dimensions, other times, it may be appropriate to have it in the fact tables. So I guess the overall answer is "maybe"! :)

The main point is that much of what is seen is "rules" is not always 100% applicable to every situation. It is possible to have a data warehouse where you have fully normalised tables, but could it be better with a different model? In many cases you will actually have a combination or even both models for added redundancy. This allows for quick updating into a normalised model and more efficient reporting from the star-based, denormalised model.

So to sum up, look for a design that will best suit your requirements, sometimes it may be fully normalised, sometimes in may be denormalised, or sometimes even both at the same time :)