The Perils Of Bad Data And Bad Data Interpretation

Share Button

A friend on Facebook posted a write-up in the American Thinker about a report issued by the Texas Secretary of State earlier this year showing 58,000 illegals voted in Texas elections between 1996 and 2015. The entire thing was completely discredited in court due to bad methodologies. After the study was scrutinized, the number of non-citizens that were supposed to have been found voting in Texas elections went from 58,000 to about 80. My friend later posted a much better study on the topic. The better study provides talking points on both sides of the political divide, that some non-citizens do vote in US elections, but on the other hand, the amount that do is quite small. Here’s the conclusion of the study.

“””Our exploration of non-citizen voting in the 2008 presidential election found that most non-citizens did not register or vote in 2008, but some did. The proportion of noncitizens who voted was less than fifteen percent, but
significantly greater than zero. Similarly in 2010 we found
that more than three percent of non-citizens reported
voting.

These results speak to both sides of the debate concerning non-citizen enfranchisement. They support the
claims made by some anti-immigration organizations
that non-citizens participate in U.S. elections. In addition,
the analysis suggests that non-citizens’ votes have
changed significant election outcomes including the
assignment of North Carolina’s 2008 electoral votes, and
the pivotal Minnesota Senate victory of Democrat Al
Franken in 2008.

However, our results also support the arguments made
by voting and immigrant rights organizations that the
portion of non-citizen immigrants who participate in U.S.
elections is quite small. Indeed, given the extraordinary
efforts made by the Obama and McCain campaigns to
mobilize voters in 2008, the relatively small portion of noncitizens who voted in 2008 likely exceeded the portion of
non-citizens voting in other recent U.S. elections.”””

The study above relies heavily on data from two studies by Stephen Ansolabehere (2010, 2011). The author of that paper coauthored a paper pointing to severe flaws in the way Richman, Chattha, and Earnest used the data. The original studies and the data provided were not designed to be interpreted to look at this question (this is one example of “P-hackking” ) . As Ansolabehere states in a rebuttal:

“””Suppose a survey question is asked of 20,000 respondents, and that, of these persons, 19,500 have a given characteristic (e.g., are citizens) and 500 do not. Suppose that 99.9 percent of the time the survey question identifies correctly whether people have a given characteristic, and 0.1 percent of the time respondents who have a given characteristic incorrectly state that they do not have that characteristic. (That is, they check the wrong box by mistake.) That means, 99.9 percent of the time the question correctly classifies an individual as having a characteristic—such as being a citizen of the United States—and 0.1 percent of the time it classifies someone as not having a characteristic, when in fact they do. This rate of misclassification or measurement error is extremely low and would be tolerated by any survey researcher. It implies, however, that one expects 19 people out of 20,000 to be incorrectly classified as not having a given characteristic, when in fact they do.

Normally, this is not a problem. In the typical survey of 1,000 to 2,000 persons, such a low level of measurement error would have no detectable effect on the sample. Even in very large sample surveys, survey practitioners expect a very low level of measurement error would have effects that wash out between two categories. The non-citizen voting example highlights a potential pitfall with very large databases in the study of low frequency categories. Continuing with the example of citizenship and voting, the problem is that the citizen group is very large compared to the non-citizen group in the survey. So even if the classification is extremely reliable, a small classification error rate will cause the bigger category to influence analysis of the low frequency category is substantial ways. Misclassification of 0.1 percent of 19,500 respondents leads us to expect that 19 respondents who are citizens will be classified as non-citizens and 1 non-citizen will be classified as a citizen. (This is a statistical expectation—the actual numbers will vary slightly.) The one non-citizen classified as a citizen will have trivial effects on any analyses of the overall pool of people categorized as citizens, as that individual will be 1 of 19,481 respondents. However, the 19 citizens incorrectly classified as non-citizens can have significant effects on analyses, as they are 3.7 percent (19 of 519) of respondents who said they are non-citizens.

Such misclassifications can explain completely the observed low rate of a behavior, such as voting, among a relatively rare or low-frequency group, such as non-citizens. Suppose that 70 percent of those with a given characteristic (e.g., citizens) engage in a behavior (e.g., voting). Suppose, further, that none of the people without the characteristic (e.g., non-citizens) are allowed to engage in the behavior in question (e.g., vote in federal elections). Based on these suppositions, of the 19 misclassified people, we expect 13 (70%) to be incorrectly determined to be non-citizen voters while 0 correctly classified non-citizens would be voters. Hence, a 0.1 percent rate of misclassification—a very low level of measurement error—would lead researchers to expect to observe that 13 of 519 (2.8 percent) people classified as non-citizens voted in the election, when those results are due entirely to measurement error, and no non-citizens actually voted.

This example parallels the reliability and vote rates in the CCES 2010-2012 panel survey. From this we conclude that measurement error almost certainly explains the observed voting rate among self-identified non-citizens in the CCES—as reported by Richman and his colleagues. “””

When I was Conservative, I used to support the idea of voter ID to ensure illegals were not voting and stealing elections. I changed my mind because no one could ever produce evidence that that kind of voter fraud was happening at any rate that justified the possible disenfranchisement of legal voters. A recent study suggests that voter ID laws don’t seem to cause much disenfranchisement. And they also don’t do much to stop voter fraud either. Of course, Conservative press only reported the results they liked, that voter ID laws doesn’t seem to lead to detectable disenfranchisement. But they don’t mention that there doesn’t seem to be any detectable fraud either. Unfortunately this paper is behind a paywall, but I’ll provide a link in case anyone wants to fork out the dough to buy it. This is what the abstract reports:

U.S. states increasingly require identification to vote – an ostensive attempt to deter fraud that prompts complaints of selective disenfranchisement. Using a difference-in-differences design on a 1.3-billion-observations panel, we find the laws have no negative effect on registration or turnout, overall or for any group defined by race, gender, age, or party affiliation. These results hold through a large number of specifications and cannot be attributed to mobilization against the laws, measured by campaign contributions and self-reported political engagement. ID requirements have no effect on fraud either – actual or perceived. Overall, our results suggest that efforts to reform voter ID laws may not have much impact on elections.

So there seems to be two lessons here. First: when you post things to support your political position, try to make sure your supporting data is accurate and says what you want. Second: If you want to make an argument to support legislation to correct a problem, make sure there is a real problem to be solved. It still looks like voter ID is a solution waiting for a problem.

RSS feed for comments on this post. TrackBack URI

Leave a Reply