This notebook was made to supplement one of my blogpost - Benford's Law as a Fraud Detection technique. Although, you'll be able to follow it along without reading the blogpost. You just need to know what Benford's law says. In this notebook, we'll see how to do a Benford's analysis and by doing so, we'll see if our dataset (historical temperatures of India) will follow the Benford's law.
I cant find the link for the dataset used. It was an open dataset I had downloaded some 7-8 months back. I had downloaded just for the India category. It consists of the temperatures from 1700s to present (with a lot of NaNs for earlier periods).
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data_dir = "Data/Climate Change/"
files = [
"WorldTemperaturesClean.csv",
"GlobalLandTemperaturesByCity.csv",
"GlobalLandTemperaturesByMajorCity.csv",
"GlobalLandTemperaturesByState.csv",
"GlobalLandTemperatures.csv"
]
wtc = pd.read_csv(data_dir + files[0], dtype=str)
wtc.head()
print(wtc.shape)
print("Column description-\n")
print(wtc.LandMaxTemperature.astype(float).describe())
fig, axs = plt.subplots(ncols=2, figsize=(10, 5))
fig.subplots_adjust(hspace=0.5, left=0.07, right=0.93)
# Frequency distribution plot
ax = axs[0]
ax.hist(wtc.LandMaxTemperature.astype(float), normed=1)
ax.set_title("Frequency Distribution")
# First digit frequencies
freq = wtc.LandMaxTemperature.str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq = freq/wtc.shape[0]
print("First digit frequencies-\n")
print(freq.sort_index())
# First frequency digit plot
ax = axs[1]
ax.bar(freq.index.astype(int), freq)
ax.plot(range(1, 10), [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046], c='red')
ax.set_title("First digit frequencies")
plt.xticks(freq.index.astype(int))
# Main title
plt.suptitle("Variable: Land Max Temperature")
plt.plot()
In the above plots we can see that this column (LandMaxTemperature) was not good for benford analysis. The min and max of the vaiable ranges from 6 to 22. This is not suitable for benford's law. Not enough variance in the data. Then the distribution also does not follow the uniform distribution pattern. Data is just not enough.
2nd plot shows the Benford plot. It does not follow the Benford's law (red line). Digits 3 and 4 are not even present in the data.
print("Column description-")
wtc.LandMinTemperature.astype(float).abs().describe()
fig, axs = plt.subplots(ncols=2, figsize=(10, 5))
fig.subplots_adjust(hspace=0.5, left=0.07, right=0.93)
# Frequency distribution plot
ax = axs[0]
ax.hist(wtc.LandMinTemperature.astype(float).abs(), normed=1)
ax.set_title("Frequency Distribution")
# First digit frequencies
freq = wtc.LandMinTemperature.str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq = freq/wtc.shape[0]
print("First digit frequencies-\n")
print(freq.sort_index())
# First frequency digit plot
ax = axs[1]
ax.bar(freq.index.astype(int), freq)
ax.plot(range(1, 10), [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046], c='red')
ax.set_title("First digit frequencies")
plt.xticks(freq.index.astype(int))
# Main title
plt.suptitle("Variable: Land Min Temperature")
plt.plot()
Here we have all the digits in the first digit frequency plot. It does not follow the Benford's law properly though. If we look carefully, both frequency distribution and first digit frequency plots are almost same. And, that is because, our data ranges from 0 to 9 same as our fir digit plot except the 0. Again, not good enough variance, no uniform distribution and small dataset.
Clearly, this column (LandMinTemperature) was not a good candidate for Benford's law. Red line delineates that.
print("Column description-")
wtc.LandAverageTemperature.astype(float).abs().describe()
fig, axs = plt.subplots(ncols=2, figsize=(10, 5))
fig.subplots_adjust(hspace=0.5, left=0.07, right=0.93)
# Frequency distribution plot
ax = axs[0]
ax.hist(wtc.LandAverageTemperature.astype(float).abs(), normed=1)
ax.set_title("Frequency Distribution")
# First digit frequencies
freq = wtc.LandAverageTemperature.str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq = freq/wtc.shape[0]
print("First digit frequencies-\n")
print(freq.sort_index())
# First frequency digit plot
ax = axs[1]
ax.bar(freq.index.astype(int), freq)
ax.plot(range(1, 10), [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046], c='red')
ax.set_title("First digit frequencies")
plt.xticks(freq.index.astype(int))
# Main title
plt.suptitle("Variable: Land Average Temperature")
plt.plot()
We have all the digits to compare. No uniform distribution can be seen. Data ranges from 0 to 15. Not enough variance again. And, dataset is small as well.
The first digit frequency plot, kind of follows the trend but fails (red line). Frequencies of digit 1 and 2 varies a lot. Then digits 6 and 7 should be higher than 8 and 9.
glt = pd.read_csv(data_dir + files[-1], dtype=str)
glt.head()
print(glt.shape)
print("Column description-")
glt.AverageTemperature.astype(float).describe()
# wtc.LandAverageTemperature.astype(str).str[0].unique()
freq_avg = glt.AverageTemperature.str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq_avg = freq_avg/glt.shape[0]
print(freq_avg.sort_index())
Here, we dont even have all the digits to get a frequency of. Moving on.
glts = pd.read_csv(data_dir + files[-2], dtype=str)
glts.head()
glts.shape
print("Column description-")
glts.AverageTemperature.astype(float).abs().describe()
fig, axs = plt.subplots(ncols=2, figsize=(10, 5))
fig.subplots_adjust(hspace=0.5, left=0.07, right=0.93)
# Frequency distribution plot
ax = axs[0]
ax.hist(glts.AverageTemperature.astype(float).abs().dropna(), normed=1)
ax.set_title("Frequency Distribution")
# First digit frequencies
freq = glts.AverageTemperature.str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq = freq/glts.shape[0]
print("First digit frequencies-\n")
print(freq.sort_index())
# First frequency digit plot
ax = axs[1]
ax.bar(freq.index.astype(int), freq)
ax.plot(range(1, 10), [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046], c='red')
ax.set_title("First digit frequencies")
plt.xticks(freq.index.astype(int))
# Main title
plt.suptitle("Variable: Global Land Temperature by State - India")
plt.plot()
Here, the situation is a bit different. Dataset is fairly large. We have a left skewed frequency distribution. If it had been a right skewed or uniform, it'd have been helpful. Data also has a small variance thus limiting the magnitude.
Benford law isn't followed properly here as well. Red line again.
gltmc = pd.read_csv(data_dir + files[-3], dtype=str)
gltmc.head()
gltmc.shape
print("Column description-")
gltmc.AverageTemperature.astype(float).abs().describe()
# wtc.LandAverageTemperature.astype(str).str[0].unique()
freq_avg = gltmc.AverageTemperature.str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq_avg = freq_avg/gltmc.shape[0]
print(freq_avg.sort_index())
Here again we have only 3 variables. No sense running the analysis.
gltc = pd.read_csv(data_dir + files[1], dtype=str)
gltc.head()
gltc.shape
print("Column description-")
gltc.AverageTemperature.astype(float).abs().describe()
fig, axs = plt.subplots(ncols=2, figsize=(10, 5))
fig.subplots_adjust(hspace=0.5, left=0.07, right=0.93)
# Frequency distribution plot
ax = axs[0]
ax.hist(gltc.AverageTemperature.astype(float).abs().dropna(), normed=1)
ax.set_title("Frequency Distribution")
# First digit frequencies
freq = gltc.AverageTemperature.str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq = freq/gltc.shape[0]
print("First digit frequencies-\n")
print(freq.sort_index())
# First frequency digit plot
ax = axs[1]
ax.bar(freq.index.astype(int), freq)
ax.plot(range(1, 10), [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046], c='red')
ax.set_title("First digit frequencies")
plt.xticks(freq.index.astype(int))
# Main title
plt.suptitle("Variable: Global Land Temperature by City - India")
plt.plot()
This is same as GlobalLandTemperaturesByState - India. Left skewed distribution and small variance. Benford's law doesn't hold here as well.
So, turns out, this dataset didn't follow the Benford's law. Some of the obvious reasons are:
The source of this and other notebooks can be found in this Github repo - Notebooks.