Data Street Cred
I am not known for being a statistics whiz. I have published quantitative work, but I am seen, rightly so, as more comfortable with qualitative work, comparing apples and oranges. Still, I had the gumption to offer advice on twitter about data today. What and why?
GDELT was a new dataset that seemed to promise heaps of utility to those who wanted to study event data–which are counts of particular events of interest and handy for analyzing events over time. It came under fire recently for a variety of reasons. I did not use the dataset nor do I work with event data, so I am not in any position to judge the dataset itself.
However, I do have experience of working with data that has been criticized. The Minorities at Risk Project was an effort originally to assess which ethnic groups might be at risk of violence. The collection selected the groups that were either mobilized or already facing discrimination. As a result, it was not so good for questions related to why groups become mobilized or face discrimination since those that are not those things were left out. For the questions I tended to ask, it was less problematic–which groups at risk tend to get more or less international support (least problematic), which groups at risk were more likely to be secessionist or irredentist (a bit problematic), which institutions are associated with more or less ethnic conflict (more problematic).
Once the dataset was criticized by some big names, pop. It got harder to publish stuff as reviewers scoffed at any findings emanating from the dataset. The good news for MAR fans is that this led to an NSF project that funded efforts to address the selection bias problem. The first piece addressing the new dataset has recently been accepted. The second piece is in the works, and now I am back in the business of pondering the relationships between institutions and ethnic conflict (the delays are my fault now for being distracted by other projects).
Anyhow, the relevance of my experience is this: GDELT is now tarnished, which means it will be harder for stuff to get published as reviewers will be harder to convince. The peer review process depends on convincing reviewers of the importance of the question, the soundness of the research design, the quality of the data, the interpretation of the findings, and so on. Given my experience, I expect that using GDELT will be risky if you want publications in the near term. Over time, the problems might be fixed or might not be that bad. But for now, its reputation is lousy. No, this post is not going to do the dirty work of making its reputation bad. That much has already been achieved. I am just making it clear that it does not matter if one believes the data to be spiffy or not, but what lies in the minds of reviewers.
So, user beware. Here be dragons.