Data & Misinformation Pt2: A Step-by-step Guide to Identify and Avoid Statistical Fallacies

Statistics refers to a branch of mathematics concerned with the collection, classification, analysis, and interpretation of numerical facts and deals mostly with drawing inferences about a population characteristic basing on a sample. This field is important because the world we live in today relies on data to formulate policies, make decisions, carry out planning, and plenty more. For example, If the government of Uganda wants to give out free mosquito nets to people living in a certain area in Uganda, they would need to know how many households exist there and how many people are living in each household, their monthly income to gauge who can and can’t afford a mosquito net, etc. If the government implemented a free education policy, they would need to know how many students benefited from that policy and how that free education impacted their lives in order to measure or monitor progress. All this shows that data and statistics are crucial and that we need them to make sense of society as a whole and measure progress in an objective way. If we don’t have this data, how can we measure our day-to-day problems and be able to fix them?

But when it comes to numbers, you should be skeptical. Like Mark Twain once stated “There are three kinds of lies: lies, damned lies, and statistics”, one needs to be able to tell which numbers are reliable and which ones aren’t. This can be achieved by educating yourself on how to spot bad statistics. This blog is here to help you achieve that and explores how misinformation can occur with statistics.

Misleading statistics

This is by far the most common form in which misinformation can occur. It often happens when a user makes up a statistic that is spread out from one person to another. For example, let’s say a certain minister while speaking to a group of people at a public event states that 68 percent of Ugandans are engaged in subsistence farming. Uganda has a population of over 40 million people and this would translate to over 27 million people being engaged in subsistence farming. Yet in an actual sense, it’s not 68 percent of Ugandans that are engaged in subsistence but rather 68 percent of those farmers engaged in agriculture in Uganda. Their number would be around 6 million.

Such claims made publicly by public officials should always be checked before being passed on to others in news reports to avoid misinformation. In Uganda, PesaCheck is an initiative that aims at addressing such scenarios by verifying these claims and publishing correct information.

Neglecting the baseline

This is another case of using statistics to spread misleading information or misinformation and lies to users. This can happen when a user compares statistics of different areas without considering the underlying factors. For example when comparing crime rates between two or more districts in Uganda, one can claim that district A has a higher crime rate than district B simply because district A recorded more criminal cases than district B. Ignoring the fact that there could be underlying factors that could be causing this such as; district A having higher population and therefore more cases compared to B. A simple fix for such a scenario would be to obtain the crime per capita figure instead.

So all these underlying details should always be included when reporting your statistics and users should always look out for such information to avoid getting misinformed.

Selection/Sampling Bias

Just because 80 percent of the people who responded to your poll selected president B doesn’t mean that the same percentage of people will choose the same candidate elsewhere. As a user, always lookout for more information about how a private study was conducted before using their data or reporting their statistics. We have all seen election results where one candidate dominates one region and the other dominates another and this can apply to any study.

However, it is also wrong to consider that statistics from polls or private studies are unreliable because you were not contacted to provide your answer or because not everybody is included in a poll or study. It is not impossible to get a view of millions of people by just interviewing a few thousand of them. A poll if perfectly unbiased with truthful answers obtained can always provide meaningful and reliable results/statistics.

Data Communication and Data Visualization

Data visualization exists to help communicate data findings in an easily understandable format that many users of different backgrounds can easily digest. But these can be very misleading and at times can be used to spread misinformation directly or indirectly. As a data visualization user, always lookout for the following features which must be part of the data visualization you are viewing;

The scales used
The starting value (zero or otherwise)
The method of calculation (e.g., dataset and time period)

Should any of these miss out, always be skeptical about such data visualizations. We published a blog on identifying misinformation in a data visualization which can be accessed here.

Publication bias

This is sometimes referred to as “fudging the data” and occurs when a researcher chooses to only publish a group of results that follow a pattern consistent with the preferred hypothesis while ignoring other results or “data runs” that contradict the hypothesis.

The greatest contributor to data-driven “fake news” is the statistical fallacies and the challenge is that few citizens have a basic understanding of statistics and data. Even just the basic areas of descriptive statistics can be quite complex for some citizens to understand. For example, if you told a student at secondary school in Uganda that one out of every four teenage girls in Uganda is either pregnant or already a mother, they would test this out in their class and find it to be false. Now imagine yourself explaining to them why this didn’t apply to their class.

Besides this, statistical fallacies aren’t the only contributors to data-driven “fake news”, outliers, missing data, and non-normality can all adversely affect the validity of the statistical analysis. Therefore, it is appropriate to study the data and repair such problems before analysis begins as a way to safeguard from spreading misinformation with statistics.

Written by Arthur Kakande, Communications Lead at Pollicy.

Data & Misinformation Pt2: A Step-by-step Guide to Identify and Avoid Statistical Fallacies