Data quality assurance requires real users
A hard lesson that I've learned over the past few months is that data quality assurance requires real data users. I wrote a handful of data processing scripts two years ago to batch parse literature citations. The workflow was functioning well, and I even ran multiple test runs on different kinds of references to find bugs. I've returned to these scripts and started processing new sets of data. I've found that the resulting datasets are riddled with problems . Some references are skipped entirely by the workflow, with no flagging system, and other features I excitedly added to the scripts are now broken. This lesson has come to bear on my work a number of times, with software and workflows written by other, highly competent folks. You just can't predict what the problems with your workflow will be, until you have real users. What is the solution here? I hope I will find some clarity soon.