Mean vs Median – Which is Better?

INTRODUCTION

This blog post describes the importance of where we should be focussing our attention when it comes to making business decisions based on average metrics such as sales revenue, employee salary, or even house prices.

Not a typical Power BI blog post but extremely important to know the difference between mean and median if you work in analytics, or if you wish to present data using Power BI as your analytical tool of choice.

WHAT'S THE DIFFERENCE?

To calculate the MEAN of a metric, we take the sum of all values and then divide that total by the number of observations.

To calculate the MEDIAN, we must first sort the values in ascending order and then seek the middle value of those observations (so 50% of the values will be below, and 50% will be above).

WHY DOES IT MATTER?

Outliers can skew things massively! Fact.

Let’s imagine that we want to know the average revenue generated by our sales team, for all of our products that were sold last year.

Now, if all of the products were of equal or similar value, then we probably wouldn’t have a problem, and could therefore reliably use the mean as our average revenue.

However, if we had a scenario where one of the products was of particular high value, then the mean figure would be skewed and would possibly yield a very misleading average benchmark. This is because the calculated average will be higher than the central tendency due to this outlier. For this reason, you should use the median instead, as this would never be affected by outliers, whether high or low value. 

Both standard deviation and interquartile range also play a huge part here.

This is the actual spread, or distribution of the data points, in this instance, product revenue.

So, if the standard deviation is a particularly large figure then this will be because of outliers and values which are much further away from the average figure. Ideally we are looking for low figures. Unfortunately there is no maximal standard deviation benchmark but lower numbers are usually better.

The interquartile range (IQR) is calculated by finding the median of the group of values below the median of the full data sample, then find the median of the upper range, then calculate the difference between these 2 medians.

You’ll discover again that the IQR is often a more suitable measure when looking at your revenue distribution here, because of the way that large outliers do not affect the IQR like standard deviation can. 

To illustrate this, look at the example graphic below of three different bell curves and the differing distribution of values.
Let’s imagine that this graphic represents 3 different scenarios of our product revenue for last year. 
 
Chart (a) is telling us that the mean (average) is low, and the median is high. This is because the distribution of product sales has a high central tendency and a few products have really low revenue, therefore skewing the average figure in a negative direction.
 
Chart (c) is the opposite. The average value in this instance is high because of the few products that lie in the high revenue area, and the main bulk of products fall lower down the revenue scale.
 
Chart (b) is known as a perfect bell curve or normal distribution, and this is where both the average and median vales are identical because of the symmetry of distributed product revenue values, thus no skew whatsoever.
 
Believe it or not, this is a very common real-world scenario where sometimes we are being misled by what we’re reading or calculating so care should be taken with the data preparation and/or presentation.