Saturday, March 23, 2013

I like to move it! Move it! Part 1.

Talking this Friday with my co-workers I've got an impression that even the best developers may not have a clear idea what does it mean to run a major migration project to or from their systems. To be fair, before I had a very serious involvement in a couple of big data transformations, I was as clueless as anyone else. And the worst part - it is not a matter of skills, it's about a mindset! So, I decided to start a series of posts on migration. Maybe my thoughts will help somebody to shorten the learning curve!

Part 1. Magic eye

If you've run one big migration project, about 9 out of 10 times in the next one you can predict potential issues better than any astrologist. Especially if the source system was in use for more than a couple of years.

For example:
- if you see DATE information, stored as text - there will be some percentage of data corruption. Not only format could be wrong! Don't bee surprised to see "I don't know" there.

- if you see a NUMBER field with the name AGE - unless you cleanse the data, you may have somebody born in the times of Ancient Rome. I am not joking - in the migration project I am doing right now I've got a couple of guys with birthday somewhere around 1 Century AD. Overall, if you are migrating from the system that has been operational in 90-ies, you MUST check for the quality of DATE fields. Sorry, Y2K problem is still around :-)

- if you see generic tables serving as parents in associations - there will be some number of strange relationships. For example, in the table PARTY we store both schools and offices. In the separate table we store school officials. I can bet, that at least one non-school office will have a dean :-)

- if there is a column storing "type" attribute - there will be some number of rows with types that cannot be explained by anybody.

- if there is a structure with start/end dates that normally should describe a logical chain - there will be always be a couple of overlapping ranges that just don't make sense. Or there will be unexplainable gaps!

- if you see ID in the column without associated FK - be prepared for child rows pointing to non-existing objects. 

I could count even more issues, but the pattern is very clear - all of those cases describe business rules not enforced in the database. Developers usually "hope" that application-enforced rules will be sufficient enough to keep proper data quality. Unfortunately, everybody forgets about feeds, services, batch jobs etc that work directly with the database. In that case (unless you have very clear server-side documentation) people do not often even know about existing rules!

Summary of part 1: data quality should be enforced as close to data layer as possible. All special cases should be documented and available to anybody working with the database directly.




No comments: