A key concept in Data Science: the difference between correlation and causality

February 17, 2021

Artificial Intelligence is a big subject. Everyone talks about it, everyone desire to use it: it’s trendy. It can crunch enormous quantity of data and discover patterns that no human mind can spot: it’s powerful.
It promises to make your day with unprecedented performances. It’s an oracle.
The combination f such attributes it’s so disarming that people tends to fall in dionysiac trance and ask to AI: here the data, do whatever it’s needed and give me the results.
But what’s trendy becomes abused if it’s applied randomly. What’s powerful tend to ebe uncontrollable without a proper supervision. And oracles can lead to disasters, if misinterpreted.

In our precedent article we already started the importance of our mixture of Data Science and Market Knowledge. Today we focus on one key concept in Data Science, the difference between correlation and causality, proceeding with toy examples using temperatures, power consumption and electricity prices.

As you can imagine, the more the power consumption grows, the more do the lectricity prices. This is correlation. 
This behavior is easily explained by the market’s laws: the more a product is requested, the more its price will rise, at least in the short term. This is the causality.

Now let’s consider a deeper example.

You build a model to predict household electricity prices, and you do it on top of Summer data. Because the data shows that  the prices rise with the temperatures, your model will take this correlation as a true fact.
But, as you can imagine, this model will just fail when applied to Winter data. In fact, the temperature-prices correlation has a direct causality effect only in limited regimes. In particular, in Summer the prices grow with the temperatures, because air conditioners need more energy to mitigate the warmth, instead in Winter you have the opposite behavior, because heating systems will absorb more power with colder temperatures.

To accommodate this, you can choose to feed the model with data coming from all the seasons, letting the model understand both the correlations and apply the appropriate one. Not that the such a model has understood only the correlations, but still it ignore the causality. This scenario does not require any effort, and follows the philosophy of letting AI deal with isssues without asking anything. Boldly, it will consider two different relations for the data based on the actual season. In most cases it will work. But how the model will behave in front of warmer weeks in winter , or colder weekes in summer? Will you trust it?

that’s why in most cases is better to build new variables based ont the available one, a process called “features engineering”. In this case, as example, one could build the “household thermal consumption” variable, which will respond in an appropriate way to the temperature variations, i.e a variable with a direct causality effect with the prices.

By using this new variable the model will be robust and trustfull.

Vincenzo Lavorini
Lead Data Scientist at COR-e

COR-e Logo

1 rue Hoche

83000 Toulon, France

client@cor-e.fr