I have been helping people to make use of secondary data for nearly two decades. When I work with groups who have little or no statistical experience, a concept we deal with early on is how to correctly read row and column percentages in a table (which might sound simple, but it is probably the MOST common mistake made in analysing or reading data).
Everyone knows nature delivers babies of each sex in equal proportion, right? Right?
The easiest way to teach this skill is to show people percentages that they should know. For example, almost no one would intuitively know what percentage of Jeep Wrangler drivers are single women, but they are very likely to know what percentage of newborn babies are female. Because everyone knows nature delivers babies of each sex in equal proportion, right? Right?
The problem is, whenever I would interrogate a dataset to show these percentages and demonstrate the point, the proportion of male to females was never quite 50/50. Naturally, this led me to investigate this counter-intuitive data issue more closely.
For some age groups, an imbalance is relatively easy to explain. Both very old and very young men are particularly adept at engaging in behaviours that limit their lifespan. At the one end, heart attacks, strokes and various ‘lifestyle’ deaths plague men in their later years, and at the other, boys in their teens and twenties often pair alcohol and drugs with vehicles and water, which serves to thin the (male) herd. Just Google ‘Craziest stunts of all time’ or ‘Extreme close calls’, and you will notice a marked absence of female actors. So, it makes sense that in these age groups one would expect more females than males.
But, confusingly, I kept finding the numbers weren’t 50/50 at birth either, with more male babies being born than female babies.
One could brush it off as a disingenuous use of secondary data, as even Census data (particularly in countries where the collection of data is fraught with many challenges) is less accurate at that level of granularity, but the pattern is evident in every single dataset.
As a Data Analyst, one of the first things you do is check your data to see if that might be the problem.
There are a few ways to do this:
Investigate who collected the data, methodologies used, and purpose of the dataset.
Check to see if this pattern occurs in the same dataset in previous years.
Test other similar secondary sources to see if the pattern is repeated in other datasets.
Check similar datasets from other countries to see if they show the same pattern.
Checking the data that I was using confirmed the higher birth rates of male babies vs female babies. In terms of a reliable, large dataset of South African demographics, it doesn’t get better than the Census (although there are obvious flaws one must be aware of), and this pattern shows up in previous Censuses as well. Another large dataset, the General Household Survey, also confirms the pattern, as does the StatsSA Recorded Live Births.
21 276 873
10 711 573
10 565 300
Data from the StatsSA Recorded Live Birth 2020. The monograph provides information on live birth occurrences and registrations from 2000 to 2020 to show trends and patterns in recorded live births.
Globally there are on average 105 male babies born per 100 female babies.
And some Googling finds this pattern in other countries as well. For England and Wales in 2017, there were 348,071 live male births and 331,035 live female births, a pattern that has been evident for nearly 180 years. In fact, globally there are on average 105 male babies born per 100 female babies.
This common occurrence is not entirely understood, but there are some theories used to explain the data:
Evolutionary factors: Males have evolved into greater risk takers than females. As mentioned above, there is a noticeable absence of female performers in YouTube videos with titles that include the words ‘most dangerous’, or ‘craziest stunts’. At all ages, men are more likely to die from health problems and suicide, but also poor choices around risky activities. But this doesn’t explain the difference at birth.
The War Theory: There are thinly proven theories that suggest during times of conflict and war, more male births occur. Data has shown that in numerous countries after a war, markedly more male babies are born than female. A theory suggests that there are men who, due to genes, produce more sons, therefore men with more sons were more likely to see at least one son return from war, explaining why those survivors were more likely to have male children.
Reproductive factors: It has been proposed that male sperm (carrying X and Y chromosomes) swim faster than female sperm (carrying just X Chromosomes) that live longer, so timing and location of the mother’s egg can determine sex of the child. In addition, research suggests that conception results in approximately 150 male zygotes for every 100 female zygotes. Further, females are more likely to be lost in the womb early on in the pregnancy, while more males are stillbirths. A report in Proceedings of the National Academy of Sciences supports this view
Human Intervention: In certain countries where male babies are more ‘valued’ than female babies, human intervention (once the sex is known) causes the skew. The countries that espouse this value system make up 36% of the world’s population. As one example, at the peak of the imbalance in 2005, there were nearly 119 male babies born per 100 females in China. In India, the ratio is 110 male babies per 100 female babies, accounting for close to 400,000 lost female births per year. It is important to note that this pattern is not limited to these two countries, and for this reason many countries have banned the use of pre-natal diagnostic techniques for sex detection.
What is interesting and exciting, albeit rather frustrating, about this is that we cannot do a controlled experiment to test any of our hypotheses when it comes to sex ratios at birth. The best we can do is try to overlay hypotheses on existing datasets.
So, while many of us crave certainty in life, this example tells us that certainty often isn’t possible. Even if you do have good data to work with, sometimes answers remains elusive. It also means I have to find a new example to demonstrate the difference between row and column percentages.