This notebook was made to supplement one of my blogpost - Benford's Law as a Fraud Detection technique. Although, you'll be able to follow it along without reading the blogpost. You just need to know what Benford's law says. In this notebook, we'll see how to do a Benford's analysis and by doing so, we'll see if our dataset (historical temperatures of India) will follow the Benford's law.
My source of data was the mailing list I follow - Data is Plural by Jeremy Singer-Vine
The UN World Food Programme’s vulnerability analysis group collects and publishes food price data for more than 1,000 towns and cities in more than 70 countries. The dataset, which goes back more than a decade, covers basic staples, such as wheat, rice, milk, oil, and more. It’s updated monthly and feeds into (among other things) the UNWFP’s price-spike indicators.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.float_format', lambda x: '%.3f' % x)
data_file = "Data/Food Prices/WFPVAM_FoodPrices_24-7-2017.csv"
data = pd.read_csv(data_file)
data.head()
data.shape
So, our data has fairly good amount of records to perform Benford's analysis.
data.dtypes
print("All the countries present-")
data.adm0_name.unique()
Thats a total of 74 countries!
plt.figure(figsize=(10,5))
data.cm_name.value_counts().head(20).plot(kind='bar')
plt.xticks(rotation=70)
plt.title("Top 20 items with most variations")
plt.show()
In total there were 300+ food items in this list.
plt.figure(figsize=(10,5))
year_stats = data.mp_year.value_counts()
plt.bar(year_stats.index, year_stats)
plt.title("Year-wise total record collection")
plt.show()
I thought record collection count will keep increasing (or probably will come to a flat line) as the time progressed, but after a peak during 2015, it decreased in 2016 to a level of 2013. The year 2017 didn't get over at the time of writing this.
data.mp_price.describe()
The prices are distributed over a very large range. As can be seen above, minimum is 0 units and maximum is 5.8 million units.
For the curious ones, the items and country pairs where price is 0 units are-
minimums = data[['adm0_name', 'cm_name', 'mp_year']][data.mp_price == data.mp_price.min()].drop_duplicates()
minimums.columns = ["Country", "Items", "Year"]
pd.DataFrame(minimums.reset_index(drop=1))
I have zero idea what Fonio is. Entry 2 is wild - Fish (snake head). Entries 6th and 7th seems out of place. If they are not then Tajikistan is the place you can get a really cheap labour. I'll have to see the data to further drill down into it. But I am leaving that for some other time.
Lets see what the max price is for.
maximums = data[['adm0_name', 'cm_name', 'mp_year']][data.mp_price == data.mp_price.max()].drop_duplicates()
maximums.columns = ["Country", "Items", "Year"]
pd.DataFrame(maximums.reset_index(drop=1))
Meh, not that interesting.
Anyway, I shoud mention this, this highest price value may not be the highest monetary value in the dataset. All of these figures are given in the respective currencies of their country. And, I haven't converted the figures to a standard one yet. And, I won't. For Benford's Analysis, I want the data as natural as I can get.
plt.figure(figsize=(15,7))
plt.hist(data.mp_price[data.mp_price < 1000], log=0, normed=0, bins=200)
plt.title("Histogram - prices < 1000")
plt.show()
The above plot shows a right skewed data with a very looooong right tail. In the plot I have capped the prices at 1000 units but it's going till 5 mill units. I am guessing, Benford's analysis will be accurate over this data.
# First digit frequencies
freq = data.mp_price.astype(str).str.replace("-", "").str.lstrip("0").str.lstrip(".").str.lstrip("0").str[0].value_counts()
freq = freq/data.shape[0]
print("First digit frequencies-\n")
print(freq.sort_index()*100)
# First frequency digit plot
plt.figure(figsize=(10,5))
plt.bar(freq.index.astype(int), freq, label='Calculated Frequency')
plt.plot(range(1, 10), [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046], c='red', label='Theoretical Frequency')
plt.legend(loc='upper right')
plt.title("First digit frequencies")
plt.xticks(freq.index.astype(int))
plt.show()
It does follow the Benford's law. Nice.
The source of this and other notebooks can be found in this Github repo - Notebooks.