Data Fallacies

Overview

In this lab, we will examine a series of common errors in reasoning that emerge when presenting data-based evidence or when countering data-based evidence. Then you will identify examples of how these errors in reasoning may emerge in your final projects.

Variable Selection

Cherry-picking

Cherry-picking is when an analyst only considers variables that support their claims in data analysis, ignoring other relevant variables that might change the outcome.

Proxy Discrimination

In some cases, there’s no easy way to measure a certain phenomena, so an analyst relies on a series of proxy variables - or variables that are meant to stand in for that phenomena. For instance, maybe it’s hard to measure happiness (because it’s not necessarily clear what that even means), so an analyst relies on variables like access to public welfare resources and number of working hours in a week as proxies for happiness. Proxy discrimination occurs when an analyst selects variables that are assumed to be representative of what they’re trying to measure, but those variables are actually correlated with other social markers. Relying on those variables in data analysis and decision-making leads to situations where individuals are discriminated against by proxy.

For example, there’s been excellent critical research into an algorithm used in criminal justice settings called COMPAS. Among other things, this algorithm attempts to predict the likelihood that someone recently released from a carceral institution will commit a crime. While the algorithm does not use race in their calculations it does use variables correlated with race - such as high school drop-out rates and family histories of incarceration. In relying on these proxies, the algorithm produces results that factor in race even without explicitly including the variable. That’s proxy discrimination.

Filtering

Sampling Bias

Sampling bias occurs when an analyst subsets their data to a sample that is unrepresentative of the broader population in an attempt to make certain claims about the broader population.

Ecological Fallacy

The ecological fallacy occurs when an analyst assumes that certain characteristics of a broader population can automatically be attributed to individuals in that population. In other words, this fallacy occurs when an analyst calculates population wide statistics and assumes they apply to everyone, failing to consider outliers. I like to think of this fallacy as a “failure to filter.”

Aggregation

Base Rate Fallacy

The base rate fallacy occurs when an analyst prioritizes case-specific information over statistics when making certain judgments. It’s when we let individual instances shape our assumptions about what is generally true. I like to think of this fallacy as a “failure to aggregate.”

Flaw of Averages

The flaw of averages occurs when we rely on averages (medians, means, and modes) to shape our understanding of what is generally always true. The problem is that it fails to account for differences and fluctuations that we can only see by disaggregating the data.

Here’s a really nice video on the “flaws of averages”:

Simpson’s Paradox

Simpson’s paradox refers to a phenomenon in which a trend that appears in data reverses when we disaggregate that data.

I recently came across an example of Simpson’s Paradox in my own research. We were investigating the percentage of carceral facilities that were “proximate”* to sites contaminated by a class of chemicals known as PFAS. We found 48% in the U.S. to be proximate. We then compared this to the percentage of hospitals that were proximate to sites contaminated by PFAS and found 56% of hospitals in the U.S. to be proximate. This suggested that there was a higher percentage of hospitals proximate to these sites than carceral institutions. However, when we disaggregated the data by whether the institution was in an urban or rural setting, the percentages were higher for both rural and urban carceral institutions than for hospitals. How did this happen? I’ll draw this out on the board to explain!

Correlation

False Causality

False causality occurs when an analyst assumes a causal relationship between variables that they’ve only proven to be correlated. This ignores the possibility that additional confounding variables may be impacting how variables appear to correlate. For instance, let’s say I work for a municipal housing department, and I notice a spike in complaints about lack of heat and hot water in buildings following a November hurricane. At first, I assume that this is due to the hurricane. …but then I zoom out and notice that this same spike occurs every year at that time - because temperatures are dropping. We can’t assume that the hurricance caused the spike in complaints just based on their correlation. Seasonality is the confounding variable here.

Sometimes correlations are coincidental. Check out this website, which includes a number of spurious correlations.

Interpretation

McNamara Fallacy

The McNamara fallacy refers to instances when an analyst prioritizes numbers and statistics over meaningful and relevant qualitative information. It’s an over-reliance on metrics. The problem with this is that there are many things that can’t easily be measured. We end up only producing knowledge on those things that we can easily measure.

Goodhart’s Law

Goodhart’s Law indicates that as soon as a measure becomes a target, it ceases to be a good measure. This is because treating numbers as targets incentives individuals to “game the system.”

Instructions

In small groups, select five of the above errors in reasoning and discuss examples of how each might lead to specific errors of reasoning in relation to your data. Narrow your conversation down to the two most compelling examples, and see if you can produce data visualizations in Tableau that are demonstrative of these errors in reasoning.