Martin Fowler recently wrote an article about incremental data migration. In it he covers some of the pitfalls to putting off data migration and the benefits associated with tackling migration iteratively. As a lot of us are doing re-writes or replacement systems in this day and age it’s worth a read.
I think there’s an important piece either implicit to the piece or missing. Namely: when we put off migration we might be exposing ourselves to the possibility a large net decrease in the quality and/or functionality of the new system. I’ll explain.
When we’re developing new systems we’re often correcting the errors of our own making or an inherited past. We are hopefully ensuring the integrity of our new data structures whether they’re produced as a side-effect of an object system or domain modeling effort or we’re making some kind of database as a primary project artifact. So can we assume that existing (and scarily dirty) data can be brought over into this new, pristine environment? Clearly the answer is “no.”
Often times we’re working with legacy data from a rat’s nest system that’s evolved over the years. I remember a particular nasty data migration that, postponed to the end of the project, took a good month to do. Not just a month of effort, a month of toil and drudgery!
Indulge me a brief war story. The data in question was from a system that had been through several data migrations and patched/in-place replacements. First Sybase, then Access, then SQL Server 6. There were several tables of questionable value, “day of week” and “gender” immediately springing to mind. One could look at rows as kinds of geological strata. Certain fields became out of date and screens ended up being coded with conditional logic along the lines of “if the record date is less than a certain day, get this semantic value from this field, otherwise…” As if this wasn’t enough there were out-and-out data integrity errors of the particularly egregious nature. There was no way the reports siphoning this data could be correct or counted on. At best they were a relative and probabilistic measure of what was really happening the business.
Naturally I made the mistake of not taking this albatross into account. Nightmare.
What can you do to avoid this situation? As Martin shares, making an initial assessment of the current data structure would be a big first step. If the data is messy, you’d be well served to tackle migration incrementally and early. But what about my (maybe not so) extreme case where data “assets” are in awful shape?
If you have the luxury of leveraging users to fix data issues, use it. Sometimes these issues can be fixed through the legacy application itself. For example, we introduced a feature in our vendor management module that ensures vendors aren’t duplicated (by their tax ID). In a client’s system there was all kinds of redundant data. We approached them with this issue and worked out a plan of collapsing duplication before counting on migrated data.
Sometimes it’s best to raise your guard against imported data in your application’s design. By taking an early assessment (which, in the case of product development, might be an educated guess) we can decided if old data is trustworthy. If not, we might build our applications in such a way that missing or invalid data is part of the app itself. Taking the vendor de-duplication example, we might have built a feature that let users correct duplication in the new system and just brought the data over as is. Expanding on that feature, we might also have prevented 1099s (tax forms) from being generated for suspected duplicates providing an exception report for these cases.
The problem with designing for bad data is the increased effort — and therefore cost — involved in design, implementation, and test. This strategy, I’d say, should be used as a last resort and sparingly. All disclaimers aside, sometimes it can’t be avoided; we can all probably tell a data horror story or two.
The Real Risk
We’ve been cruising on this new project. We’re happy with the design and it’s an order-of-magnitude better than the previous solution. Our client’s going to be thrilled! That’s a lovely feeling to be sure, but in reality what we’re bringing forward might be a limiting factor in total success. You might have to make disappointing compromises like scrubbing new features or extending a project’s scope/time/budget if you’ve developed a feature that simply isn’t compliant with old data.
A thorough initial assessment paired with incremental migration can help you make Agile decisions about architecture and client involvement. Without techniques like this you’ll essentially be rolling the dice on how long migration takes or, at the worst case, whether new features are practical without a whole slew of compensating or enabling features.