Outline
The normal distribution is also called the bell curve or Gaussian distribution. The bell height represents the mean position, and the bottom width of the bell represents the spread of values (standard deviation). Thus, the shape changes as we change mu (\(\mu\)) and sigma (\(\sigma\)). The \(\mu\) is the mean or average of the sample, and \(\sigma\) is the standard deviation. We denote a normal distribution as:
\[{\mathcal {N}}(\mu ,\sigma ^{2})\]Find more details about the normal distribution on Wikipedia. Here are two ways of defining a normal distribution in Python.
1
2
3
from statistics import NormalDist
mu, sigma = 5, .5
norm_dist = NormalDist(mu, sigma)
1
2
3
import scipy.stats as stats
mu, sigma = 5, .5
norm_dist = stats.norm(mu, sigma)
We get a lognormal distribution when we apply exponentiation to the normal distribution. The result is a lopsided curve. It means that there is a longer tail on the right side, where larger values occur. We denote the lognormal distribution as follows:
\[{\displaystyle \ X\sim \operatorname {Lognormal} \left(\ \mu _{x},\sigma _{x}^{2}\ \right)\ }\]Since the log of the lognormal distribution is a normal distribution, we can denote the relationship as follows:
\[{\displaystyle \ln(X)\sim {\mathcal {N}}(\mu ,\sigma ^{2})}\]Find more details about the lognormal distribution on Wikipedia. We define a lognormal distribution in Python as follows. The Python stdlib does not have a lognormal implementation.
1
2
3
4
import numpy as np
import scipy.stats as stats
mu, sigma = 5, .5
norm_dist = stats.lognorm(s=sigma, scale=np.exp(mu))
Note: the scipy.stats.lognorm
takes mu and sigma of the underlying normal distribution from which we derive the lognormal distribution. While providing the scale
parameter, we take the exponentiation of the mean of the normal distribution. I found the documentation inadequate in explaining the parameters. This SO question has answers that discuss the meaning of the parameters.
Here is how both the distributions look for the same mu (\(\mu\)) and sigma (\(\sigma\)).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# all distributions
mu, sigma = 5, .5
norm_d1 = NormalDist(mu, sigma)
lognorm_d1 = stats.lognorm(s=sigma, scale=np.exp(mu))
lognorm_d1.mu, lognorm_d1.sigma = mu, sigma
mu, sigma = 5, 1
norm_d2 = NormalDist(mu, sigma)
lognorm_d2 = stats.lognorm(s=sigma, scale=np.exp(mu))
lognorm_d2.mu, lognorm_d2.sigma = mu, sigma
mu, sigma = 4, 0.3
norm_d3 = NormalDist(mu, sigma)
lognorm_d3 = stats.lognorm(s=sigma, scale=np.exp(mu))
lognorm_d3.mu, lognorm_d3.sigma = mu, sigma
# norm y
x = np.linspace(0, 10, 500)
norm_y1 = np.array([norm_d1.pdf(i) for i in x])
norm_y2 = np.array([norm_d2.pdf(i) for i in x])
norm_y3 = np.array([norm_d3.pdf(i) for i in x])
# lognorm y
x = np.linspace(0, 800, 500)
lognorm_y1 = np.array([lognorm_d1.pdf(i) for i in x])
lognorm_y2 = np.array([lognorm_d2.pdf(i) for i in x])
lognorm_y3 = np.array([lognorm_d3.pdf(i) for i in x])
# Set the figsize
fig1, ax1 = plt.subplots(figsize=(6, 4))
ax1.plot(x, norm_y1, label=f"mu = {norm_d1.mean}; sigma = {norm_d1.stdev}")
ax1.plot(x, norm_y2, label=f"mu = {norm_d2.mean}; sigma = {norm_d2.stdev}")
ax1.plot(x, norm_y3, label=f"mu = {norm_d3.mean}; sigma = {norm_d3.stdev}")
ax1.legend()
fig2, ax2 = plt.subplots(figsize=(6, 4))
ax2.plot(x, lognorm_y1, label=f"mu = {lognorm_d1.mu}; sigma = {lognorm_d1.sigma}")
ax2.plot(x, lognorm_y2, label=f"mu = {lognorm_d2.mu}; sigma = {lognorm_d2.sigma}")
ax2.plot(x, lognorm_y3, label=f"mu = {lognorm_d3.mu}; sigma = {lognorm_d3.sigma}")
ax2.legend()
plt.show()
fig1.savefig('norm_dist.svg', format='svg', dpi=1200, bbox_inches='tight')
fig2.savefig('lognorm_dist.svg', format='svg', dpi=1200, bbox_inches='tight')
NormalDist.pdf()
we can also use numpy.random.Generator.normal
to get a normal distribution sample and plot a histogram. Similarly, for lognormal distribution, instead of stats.lognorm.pdf()
, we can use numpy.random.Generator.lognormal
.
As mentioned in the previous section, normal distribution is just a log of the lognormal distribution. So, if \({\displaystyle \ X\sim \operatorname {Lognormal} \left(\mu _{x},\sigma _{x}^{2} \right)}\), then \({\ \displaystyle \ln(X)\sim {\mathcal {N}}(\mu ,\sigma ^{2})}\).
Let us understand this by code.
1
2
3
4
5
6
7
8
9
import numpy as np
rng = np.random.default_rng()
mu, sigma = 5, .5
lognorm_samples = rng.lognormal(mu, sigma, 10000)
# take the log of lognorm samples to derive the normal dist.
norm_samples = np.log(lognorm_samples)
print(norm_samples.mean(), norm_samples.std())
5.005339216906491 0.4934326302969564
The parameters (mean and std) of the derived normal distribution (line 7) are the same as the original parameters we provided to the lognormal dist (line 5).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# log normal dist
fig1, ax1 = plt.subplots(figsize=(5, 3))
ax1.hist(lognorm_samples, bins=50, alpha=0.7, density=True, color="orange")
x1 = np.linspace(0, 800, 500)
lognorm_d = stats.lognorm(s=sigma, scale=np.exp(mu))
lognorm_y = np.array([lognorm_d.pdf(i) for i in x1])
ax1.plot(x1, lognorm_y, label=f"mu = {mu}; sigma = {sigma}")
ax1.legend()
# normal dist
fig2, ax2 = plt.subplots(figsize=(5, 3))
ax2.hist(norm_samples, bins=50, alpha=0.7, density=True, color="orange")
x2 = np.linspace(0, 7, 500)
norm_d = stats.norm(mu, sigma)
norm_y = np.array([norm_d.pdf(i) for i in x2])
ax2.plot(x2, norm_y, label=f"mu = {mu}; sigma = {sigma}")
ax2.legend()
plt.show()
fig1.savefig('lognorm_dist2.svg', format='svg', dpi=1200, bbox_inches='tight')
fig2.savefig('norm_dist2.svg', format='svg', dpi=1200, bbox_inches='tight')
Conclusion: to convert from a lognormal to normal, take the logarithm of the lognormal sample.
If the logarithm of a lognormal distribution is normally distributed, then the reverse will also be true. That is, the exponential of a normal distribution will give us a lognormal distribution. In notation, if \({\displaystyle Y\sim {\mathcal {N}}(\mu ,\sigma ^{2})}\), then \({\ \displaystyle \exp(Y)\sim \operatorname {Lognormal} \left(\mu _{x},\sigma _{x}^{2} \right)\ }\).
Let’s again understand this through code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import scipy.stats as stats
rng = np.random.default_rng()
mu, sigma = 5, .5
norm_samples = rng.normal(mu, sigma, 10000)
# take the exp of norm samples to derive the lognormal dist.
lognorm_samples = np.exp(norm_samples)
# fit a lognorm distribution to get the mean and std dev
shape, loc, scale = stats.lognorm.fit(lognorm_samples)
mean, stddev = np.log(scale), shape
print(mean, stddev)
4.984256782660331 0.5067622675605842
The parameters (mean and std) of the derived lognormal distribution (line 10) are the same as the original parameters we provided to the normal dist (line 6). Note that we used the [scipy.stats.lognorm.fit
] method to fit the lognorm distribution on the data. It gives us the following three parameters: loc
, shape
and scale
. The shape
is same as standard deviation. To get the mean, we have to take the logarithm of the scale
. We did not have to do this when we converted the lognormal to a normal distribution (previous section) because we can directly get the params (mean and std). Read this SO answer for more details.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# normal dist
fig1, ax1 = plt.subplots(figsize=(5, 3))
ax1.hist(norm_samples, bins=50, alpha=0.7, density=True, color="orange")
x1 = np.linspace(0, 7, 500)
norm_d = stats.norm(mu, sigma)
norm_y = np.array([norm_d.pdf(i) for i in x1])
ax1.plot(x1, norm_y, label=f"mu = {mu}; sigma = {sigma}")
ax1.legend()
# lognormal dist
fig2, ax2 = plt.subplots(figsize=(5, 3))
ax2.hist(lognorm_samples, bins=50, alpha=0.7, density=True, color="orange")
x2 = np.linspace(0, 800, 500)
lognorm_d = stats.lognorm(s=sigma, scale=np.exp(mu))
lognorm_y = np.array([lognorm_d.pdf(i) for i in x2])
ax2.plot(x2, lognorm_y, label=f"mu = {mu}; sigma = {sigma}")
ax2.legend()
plt.show()
fig1.savefig('norm_dist3.svg', format='svg', dpi=1200, bbox_inches='tight')
fig2.savefig('lognorm_dist3.svg', format='svg', dpi=1200, bbox_inches='tight')
Conclusion: to convert from a normal to lognormal, take exp of the normal sample.
We started with the Normal and Lognormal distributions and with their definition in Python. We converted each of the distributions into the other. It took me some effort to figure out how to do the conversion. With this post, I tried to demystify the confusion.
If you are interested in how other distributions look, your search is over. This SO answer has visualisations of all the distributions available in scipy.stats.
Update: 18th Jan: Someone asked me the following question on reddit.
For what purpose are you converting between normal and lognormal? The two functions share the same parameters but thats about it. ln(data) is a non-destructive transformation but the process can obscure patterns just as often as it reveals them. Certain advanced statistical tests that require a normal distribution cannot necessarily have the results applied to the lognormal data.
This stranger is correct that patterns are obscured, or rather, some other patterns come up after log transformation. Although, in my case, it did not matter.
I wanted to match the customers with the items that are within the customer spending range. The formulation was that if I have customer and outlet distributions, then I can match these distributions or get the overlap to get the match percentage. This match percentage will then be used on top of relevance scores.
Looking at the customer’s spend history, I saw that the distribution was lognormally distributed. A similar trend was observed in the restaurant’s order history. Since, computing the overlap in the production env was easier with the normal distributions, I was okay with the conversion. I will cover this in more detail in a future post.
]]>I stayed at Zostel📍. Advice: get the ten-bed dorm. It has access to the balcony. You can see all of Ooty from the balcony. At night, the lit-up Ooty City takes your mind away from everything.
On my workations, my stay in any city has taken one of the following tracks: make friends and go with the flow or explore solo. Ooty was the city of friends.
I reached the hostel early in the morning. While I waited for my dorm bed to get cleaned up, I met Arvind and Sriram. All of us were in the same dorm. Enjoying the Ooty view from the balcony, we had an engaging conversation. It was about introduction, backgrounds, work, and protocols. Introducing protocols is out of scope for this post, but do follow the shared link. The conversation ended when my time for a work meeting got closer.
There was almost a routine to our days. I get up in the morning and start my work after breakfast in the Hostel Cafe. These guys would chill or go out to some tourist points. I would have lunch and continue working till the evening. After dinner, all three of us and others (new people came and went every other day) would sit around the bonfire and talk. Sometimes we sang and played music. I would also try to produce some sounds with my Ukulele.
One evening, wrapping up early from work, three of us went for a movie. On all my travels, I have watched movies on hostel TVs, but never once have I gone to a movie in a theatre. It was a novelty experience. Let me describe the whole scene.
The movie in focus was Blue Beetle (judge all you want 😅). The only way to book the ticket was at the theatre called Assembly Rooms📍. We reached there ten minutes early. Including us, there were only five people present. The booking window was not yet open. At the show time, the ticket guy informed us that he needed at least six people to start the show. Our strength was still five. We agreed to buy an extra ticket (each ticket was 180 bucks.) After waiting for five more minutes, he let us watch the movie. The same guy was also the movie operator and started the show. The whole theatre was ours. We watched the senseless movie from the 3rd row and made fun of the various scenes.
I had conversations with both of them on a range of topics. The topic of protocols continued. We also talked about green and sustainable energy (Arvind’s forte), their college life (they know each other from college), Peru (Arvind’s base), Ooty’s past (Arvind used to come here from Coimbatore in his childhood), quantified self, music, learning music, tradition of learning music or dance in Tamil families, and a lot more.
One by one, both of them left. It is always sad when someone leaves after you have spent time with them. It is a reality of life, but it is still sad to be left behind. Despite that, you keep going, and then it becomes normal again. You meet new people, and the cycle continues.
Ooty (and the whole of Tamil Nadu) is hard on solo travellers without transport. In every city, I rent a scooty to travel around. Ooty (read: Tamil Nadu) doesn’t allow renting vehicles. The weekend was here. I wanted to see Ooty. So, I asked for the touring taxis at the reception. The taxi guy quoted 2500 bucks for the whole day. It was time to make new friends.
I met Rajesh, who was also looking to share the taxi. Meghna and Gargee were the final two. And our impromptu travel group was ready. Although the taxi driver increased the rate to 3100 INR, it was still better for all four of us.
Rajesh is a PhD student in Geology. He was studying rocks somewhere near Ooty and decided to spend the weekend here. He is more interested in Climate Science and will join another PhD program in a few weeks.
Meghna and Gargee are childhood friends from Gwalior, Madhya Pradesh. Meghna works at an IT company in Pune. Gargee is an architect in Bangalore. She left her job and was moving to Ahmedabad to become a Landscape Architect. The reason for them coming to Ooty was to enjoy Gargee’s last trip from Bangalore.
We went to the following points on our tour.
It took us till evening to cover all these points. We asked our cab driver to drop us in the city market. It was a full-moon night. We checked out some shops in the market, ate our dinner at Adyar Ananda Bhavan - A2B 🍽️ and called it a night after warming up around the bonfire.
The morning, I accompanied Meghna and Gargee to the Bus Depot. They were leaving for Wayanad. After seeing them off, I went to explore the city. I saw a watch tower. On my way, I saw scores of people entering and leaving an alleyway. My curiosity made me go into the alley, and I saw a complete change in the landscape. I was standing in a fruit and veggie market. There were multiple alleys leading to different sections of the market: fruits, veggies, meat, dry fruits, flowers, and other grocery stuff. It felt like the Diagon Alley in the Harry Potter universe.
After coming back, Mugdha and Madhvi were painting the wall art in the common area.
A little backstory. My dorm was right next to the common area. On my first day in the hostel, the staff cleared and re-painted a wall in the common room for a new wall art. Mugdha came a few days later to pencil the outline. The outline consisted of the elements of Ooty: coffee plantation, toy train, toda tribe, rose garden, and pine forest. During the weekend, Madhvi arrived. Both of them started colouring the sketch. We became friends, and I started my apprenticeship under them. Soon it became a group of seven: Mugdha, Madhavi, Anshul, Sumit, Gautam, Vani, and me.
So, I started helping them with colouring after returning from the market. The last time I held a paintbrush was in the 9th grade, more than 14 years ago. I coloured different shades of roses, coffee beans, leaves, and grasslands. I enjoyed filling up those shapes using a paintbrush. I first added a colour coat without worrying about the brush strokes. Later, I painted over it to make it consistent. I aligned the strokes with the outline to make it coherent. I experimented with multiple ways of moving my brush. Slowly, I could achieve the same effect in fewer steps and less paint. It was a very calming activity. I used to leave more delicate stuff for my teachers. By the end, both of my teachers were proud of my work. 😁
Along the way, all of us talked extensively. Both Mugdha and Madhvi are from Mumbai and friends from college. Mugdha is into art, and Madhvi likes textile design. I learnt how students learn in art schools in India. Madhvi had been working in Design Thinking for school kids. She is going to get into textile design next. Mugdha is going to experiment more with painting and colours. One morning, Mugdha introduced me to Aahatein by Agnee. (This song was at the top of my Spotify Wrapped this year.) We discussed similar songs. Madhvi showed her skills on my Ukulele. I also got to know about the Kochi-Muziris Biennale. I witnessed the traces of previous iterations of Biennale when I went to Kochi a few weeks later.
Anshul, who works on APIs for clients, was trying to explain to Madhvi and Mugdha what he does. While trying to help him explain, we learnt that many women, like Madhvi and Mugdha, enjoy having nerdy talks and would date such men. Then the topic moved to interesting or weird dates many of us have experienced.
Gautam played a ten-minute movie called Zima Blue from the Love, Death & Robots animation series. Gautam is a volunteer teacher in a nearby village. He is visiting India for a few months, following which he’ll return to the US. Sumit was on a road trip on his KTM and headed back home to Pune from Ooty. We talked about many of his road trips. Vani was on a weekend trip from Bangalore.
I enjoyed painting so much that I extended my stay by a few more days. I, unfortunately, had to say goodbye to everyone before it was complete.
From the hostel, I took the bus to Conoor. And from Conoor📍, I took the toy train to Mettupalayam📍.
Fortunately, I got the window seat assigned to me. The train - running on steam - passed through multiple bridges, coffee plantations, and dark tunnels opening up to beautiful valley views. The train stops at two railway stations along the way. Both the stations had an old vibe to them. On the second stop, they also fueled up the engine with water.
At the end of this journey, I took another bus to Coimbatore📍 and reached my hotel (no hostels in Coimbatore ☹️). I met with two people here: Arvind and Guhan. Arvind, whom I had met at the beginning of this post, returned to Coimbatore after leaving Ooty. Guhan is a friend I made in Hampi earlier this year. The story of Coimbatore will continue in another post.
]]>Will you display it as a table? Not everyone can grok it. It will also take time to walk people through the table.
You will format your table with colours (conditional formatting) to show peaks and bottoms. That will work. However, it will become dense as number of rows increase. Furthermore, this workflow involves exporting data from your respective data system (database or data lake) and importing it to Excel/Google Sheets. Thus, it is not feasible in all situations. One of those situations is what I faced.
I was doing an analysis of customer data at work. I wanted to see the distribution of cuisines in two subsequent orders. For example, the customer ordered Chinese food followed by South Indian in the next order. Because sequence matters for my analysis, Chinese to South Indian and South Indian to Chinese would be two separate rows. As you can imagine, a significant part of the GroupBy output contained these redundant pairs. It was difficult to derive any insights from it.
Fortunately for me, I was able to recall the bipartite graphs. Bipartite graphs model the relationship between two classes of objects. For example, think about the relationship between owners and their cars. An owner can own ore or more cars. An owner can not own other owners. Similarly, a car can not own other cars. A bipartite graph will only show a relationship between a vehicle and its owner (two different classes of objects).
It was perfect for my visualisation problem at hand!
However, to generate a presentable graph turned out to be slightly roundabout. This article is to document the process for my future self.
As expected, the NetworkX Python library had all the utilities available. The steps are as follows:
networkx Graph
.bipartite_layout()
to define the layout for a bipartite graph.draw()
.There are more minor steps involved that we will cover during the deep dive. Since NetworkX plays well with the Matplotlib library, we have all the Matplotlib utilities available to us.
I will visualise the age-wise top causes of death according to WHO.
We start with the necessary imports.
1
2
3
4
5
import random
import pandas as pd
import networkx as nx
from matplotlib import pyplot as plt
We have to pre-process the data for the viz.
1
2
3
4
5
6
7
8
9
10
11
data = pd.read_csv("male.csv").set_index("cod").T
data.columns = ["cod_"+i for i in data.columns]
data = data.rename_axis('age_group').reset_index(drop=False)
data = pd.wide_to_long(
data, stubnames="cod", i=['age_group'], j="cause", sep='_', suffix=r'[\w ,]+'
)
data.columns = ["percent"]
data = data.reset_index(drop=False)
data["percent"] = data["percent"].str[:-1].astype(float)/100
data = data[data.cause != "All Causes"]
data.head(2)
The data is ready. I wanted all the edges with the same start in the same colour. So I added an integer corresponding to each class using the below code. We will use this column to get a random colour for each label with a colour map.
1
2
3
4
# colors
node_dict = dict([(j, i) for i, j in enumerate(data['age_group'].unique())])
data["node_color"] = data["age_group"].apply(lambda x: node_dict[x])
data.head(2)
I am loading the data and converting the wide to the long format for NetworkX. Next, we define our graph using this data.
1
2
3
4
5
edges = [tuple(x) for x in data[['age_group', 'cause']].values.tolist()]
B = nx.Graph()
B.add_nodes_from(data['age_group'].unique(), bipartite=0)
B.add_nodes_from(data['cause'].unique(), bipartite=1)
B.add_edges_from(edges)
Below is how we visualise the graph.
1
2
3
4
5
6
7
8
9
10
11
# matplotlib variables
fig, ax = plt.subplots()
fig.set_size_inches(9, 6)
# First specify the nodes we want on left or top
# create a bipartite layout
left_or_top = data['age_group'].unique()[::-1]
pos = nx.bipartite_layout(B, left_or_top, scale=10)
# Pass that layout to nx.draw
nx.draw(B, pos, node_color='#A0CBE2', edge_color="white", width=1)
We define Matplotlib variables. Use bipartite_layout
to get the required layout and draw the graph. Note that, without edge_color="white"
, we can stop at this step. We will get equal width, constant colour edges and nodes. The next few steps will fix the presentation aspect of the plot.
We colour the edges first.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# define random color map - https://stackoverflow.com/a/68459848/2650427
colors_ = lambda n: list(
map(lambda i: "#" + "%06x" % random.randint(0, 0xFFFFFF), range(n)))
colors = colors_(len(data.age_group.unique()))
# draw each edge
edge_width_dict = (
data[['age_group', "cause", "percent"]]
.set_index(['age_group', "cause"])
)
for node in data[['age_group', "node_color"]].drop_duplicates().values:
edges = B.edges([node[0]])
color = colors[node[1]]
edge_widths = [edge_width_dict.loc[i]["percent"] for i in edges]
nx.draw_networkx_edges(
B,
pos,
edgelist=edges,
width=edge_widths,
edge_color=color,
)
We iterate through all the starting nodes and their corresponding colours. We get each point and its edges and colour them the same but vary their width according to the percent
column.
Last configuration is the node labels and their alignment. Without this segment, all the node labels would be centre-aligned. A long string is truncated in the viz. I want to point out that neither the documentation nor Stack Overflow could help me here. My saviour was ChatGPT. It gave me a working example using draw_networkx_labels()
that I modified as below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# left node labels alignment
for node_name in data['age_group'].drop_duplicates().values:
node = {node_name: node_name}
node_pos = {node_name: pos[node_name]}
label_pos = nx.draw_networkx_labels(
B, node_pos, labels=node, font_size=10,
horizontalalignment='left',
verticalalignment="bottom"
)
# right node labels alignment
for node_name in data['cause'].drop_duplicates().values:
node = {node_name: node_name}
node_pos = {node_name: pos[node_name]}
label_pos = nx.draw_networkx_labels(
B, node_pos, labels=node, font_size=10,
horizontalalignment='right',
verticalalignment="bottom"
)
plt.show()
Time to see the results.
Male children mostly die due to Infectious and parasitic diseases, Respiratory infections, Maternal conditions, Neonatal conditions, and Nutritional deficiencies. Most teen and youth deaths (15-29 years in age) happen due to injuries. As men get old, serious ailments (Birth ailments, Cancer, Cardiovascular, Respiratory, and others) become more pronounced causes of death.
Females follow a similar distribution. One notable difference is that relatively few women die due to injuries. Is that the reason women live longer than men?
The plots effectively showed the common diseases for each age group. Of course, this plot only gives a summary. And the summary is what we wanted from this viz.
The plots were 90% there. Unfortunately, there are a few flaws.
While it provides me with a summary, it does not tell me the strength of the relationship. In that aspect, it is similar to pie charts. And the internet is filled with articles about why pie charts are unhelpful plots.
Another issue is the random colour and edge width assigned to each edge. A node may be yellowish-green in colour. Even if the edge width is relatively higher, the edge will still not be prominent. I re-ran my code to get the version with the right colours. We could solve this by hand-selecting the colours and tuning the edge widths with a constant factor.
We wanted a summary visualisation of our GroupBy (or pivot table) output. To achieve that, we converted it into a bipartite graph and rendered it using Matplotlib.
There are flaws in this visualisation. The strength of the relationship is not apparent. Additionally, edge colour and widths need tuning to make the strong relationships prominent. Fixing these issues is a future work.
]]>I have a JSON column in my DataFrame.
I need to format it as a JSON object (struct
) to extract anything out of it. How do I convert it into a struct
?
Here is the solution if you are short on time. In the next section, I discuss it in more detail.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Spark 3.2.1 | Scala 2.12
import pyspark.sql.functions as F
# Sample json we will work with.
sample_json = """
{
"lvl1": {
"lvl2a": {
"lvl3a": {
"lvl4a": "random_data",
"lvl4b": "random_data"
}
},
"lvl2b": {
"lvl3a": {
"lvl4a": "ramdom_data"
},
"lvl3b": [
{"lvl4a": "random_data"},
{"lvl4b": "random_data"}
]
}
}
}
"""
# Spark dataframe with json column
df = spark.createDataFrame([(sample_json,)]*4, ["json_data"])
# determine the schema
json_schema = F.schema_of_json(df.select(F.col("json_data")).first()[0])
# converting json to struct
df = df.withColumn("json_data_struct", F.from_json("json_data", json_schema))
We will use pyspark.sql.functions.schema_of_json
to do our dirty work of determining the schema.
Just like any other column-based function, I expected this function to work on a column. So I tried this as below:
It threw the below error:
I did not know what is a foldable string. The data type of the json_data
column was a string. The ChatGPT also suggested the same way of using this function. :)
The documentation and multiple Stack Overflow answers [1, 2, 3] helped me reach an explanation.
The schema_of_json
needs a single string instead of a column. So I extracted one JSON string from the column and passed it to the function. This is how I did it:
1
2
json_string = df.select(F.col("json_data")).first()[0]
json_schema = F.schema_of_json(json_string)
The end.
I like self-tracking. I want to track my productivity and find potential improvements. What gets tracked also gets measured. My interest in the Quantified self has evolved.
I started with Gamification of Life, where I assigned points to everything I did. It got too overwhelming after a year.
I lack at maintaining relationships with friends and family. For insights, I analysed my chats’ metadata - Chatting Up and Chatting Up - Part II. It was interesting to see how my interactions changed over time with friends. I also wanted to play with the chat content, but NLP capabilities at that time weren’t enough to deal with Hinglish text.
I track my call logs and someday would like to analyse them.
Two years back, I even started making an app to track anything. You can read more about the app here:
Eventually, other commitments and curiosities caught up, and I couldn’t finish it. 😅
I also track my spending and habits. I am sure I have skipped a few more.
Fitness was another frontier where I tried many things. Tracking was always enabled using Google Fit and Maps, but I did not do anything with the data. Google Fit analytics on the app was helpful, but I wanted more. Time was difficult to find between work, travel, and other interests. I wanted something quick and easy to develop/maintain.
Introducing: Do More With Less (DMWL). It is a trend at work where you identify and prioritize the tasks that are quick to execute with good ROI.
I googled previous work in this direction. I found this medium article that pointed to this more helpful article doing what I wanted - Export Google Fit Daily Steps, Weight and Distance to a Google Sheet[code from the blog is available here: google-fit-to-sheets/Code.gs]. It made almost everything straightforward - setting up the app, auth, pulling and formatting data. Chat GPT complemented my lack of knowledge of Javascript to write code in Google Sheets.
Update: 19th July
I had to do set up the credentials multiple times now. Add the steps here so quick reference. In depth instructions are on apps-script-oauth2/README.md.
1B7FSrk5Zi6L1rSxxTDgDEUsPzlukDsi4KGuTMorsTQHhGBzBkMun4iDF
as the Script ID. This will find the Google OAuth2 Lib. Select the latest version and save.https://script.google.com/macros/d/{SCRIPTID}/usercallback
and replace the {SCRIPTID}
with the Script ID copied in step 3. Note down the client id and client secret created at the end. We will use these in our script. [You can read more on Setting up OAuth 2.0.]Update: 28th July
The script stops working after a few days. This has happened twice with me. I get the following error:
Error: Access not granted or expired.
Service_.getAccessToken
@ Service.gs:518
I still haven’t found the solution, but I have a few leads:
End of the Update
In this section, I will discuss the coding involved. You can skip to the next section for the dashboard.
Here is how the code flow is:
com.google.step_count.delta
com.google.heart_minutes
com.google.weight.summary
com.google.activity.segment
getActiveSpreadsheet()
and append the data on the last empty row using getLastRow()
.copyTo(destination, options)
function.copyTo()
function.It is dense! Extending it to my signals required scouring over the docs and multiple SO answers.
The first helpful link was: Users.dataSources: list. It gave me all the data points I can ask for from the Fit API. The next challenge was discovering the schema and what different fields meant. After several hit-n-trials, I found the Activity Types page. It gave me the required ID for each activity.
Update: 28th July: I stumbled upon the guide to the REST API of Fit.
I hope you will find these links useful.
There are two immediate hurdles I need to cross.
I have to authorize the app every day to call the API. A quick search told me that the token expires after an hour. I have to use a parameter called expires_in
to refresh the token. Unfortunately, I could not figure out how to use it.
Similarly, I have to call the function daily (by pressing a button from the menu). I can automate it through the time-driven (clock) trigger [ref: triggers]. The problem is the token expiration. The trigger will fail the next day because the token is stale.
Time for the final results.
Google Fit gives you a Heart Point (HP) for each minute of activity you do. Here is how mine looks. My heart points mainly include walking, working out, and a little bit of swimming.
The red line is an aggregated line to make it easy to see the trend.
The yellow-shaded regions highlight the days I was workationing. During these days, my heart points rarely reached zero. Whereas in the non-shaded periods, I frequently hit zero. Those zeroes are my two rest days after five days of working out.
My heart points are also cyclic in nature. Whenever I am home, my only regular activity is exercising with a two-day break every week. Walking becomes an occasional affair.
Let’s look at these heart points more closely.
Can you notice the complementary nature of the two graphs? First, look at the red line and then focus on blue.
During the travel period, my HP fluctuated because of the crazy number of daily steps (a proxy for walking) and irregular/short workouts. And when I am at home, I go crazy with my exercise. During May, a few things at home led to more walking and small and irregular workouts.
That’s the end of my DMWL version of the Fitness dashboard.
The first stage of my dashboard is complete. I will iteratively update it to get more out of it. I discussed the tech improvements in the coding section. As a part of my fitness tracking journey, here is what I want to do.
I want to come up with some standards or thresholds for myself. It could mean saying something like reaching 30-40 HP daily or working out for fifty minutes five days a week. I do not know what these standards will look like.
I want to use this dashboard to motivate myself. It can be a tracker for fitness-related habits. It should all be automated. Consequently, I want this dashboard to help me inculcate new fitness-related habits. Towards this vision, the next step is to add more activities to the mix, namely meditation, cycling, and swimming.
I live a healthy lifestyle. Another goal of this dashboard is to observe how different decisions in my life impact my health and productivity. That will help me course correct. Some directions are:
There are more, but these are the important ones in my mind.
Update: 19th July: I added meditation and cycling under the activities section. In the nutrition section, I added calories burnt and water intake. Calories burnt is Fit’s approximation. The water intake is tracked manually in the app like the activities.
]]>Good intentions never work, you need good mechanisms to make anything happen - Jeff Bezos.
A mechanism is a process where
Read more - Building mechanisms.
I have observed that mechanisms work more efficiently than good intentions. Thus, I always try to convert good intentions into mechanisms. Looking at how Data Science projects are not as streamlined as Software Engineering projects, I searched for guidance on managing them better.
The following are my notes from Mechanisms for Effective Machine Learning Projects and related articles. These mechanisms apply to any data science project (work or hobby). Note that these mechanisms are mental tools. You have to make them a habit for mental tools to work.
Share unusual findings; Discuss ideas; Get help on bugs; Ask for reviews; etc.
Mechanism to execute projects with high confidence.
Classification, regression, or something else?
How was data excluded, preprocessed, and rebalanced? How were labels defined? Was a third neural class added? How were labels augmented, perhaps via hard mining?
How was the training and validation set created? What offline evaluation metrics did they use? How did they improve the correlation between offline and online evaluation metrics?
For single paper
For multiple papers from the same domain
Find more here: How Reading Papers Helps You Be a More Effective Data Scientist.
I will keep updating/adding to this list as I read/experiment more.
]]>My first interaction in the city: ₹250 for an auto for 2 KMs. That’s just robbery! He came down to ₹100 after I told him off. Then found another auto guy who said ₹80. (This was also high, but I was tired.)
Belagavi is the (proposed) 2nd capital of Karnataka. It is a controversial city: both Maharashtra (Maha) and Karnataka want it within their borders. I don’t understand this border dispute. 😞
Update: A journalist friend, Sarayu, gave me a few pointers here. Language is the reason for this fight. Most of North Karnataka up to Goa know Hindi. Thanks to Nizam rule and the districts that border MP and Maha. And Hindi becomes more important because Bombay is closer than Bangalore. On top of that, this region becomes strategically important because it is geographically closer to upcoming districts like Karwar. There is also a spurt of educational institutions like IIT Dharward/Raichur coming up.
Language. Almost everyone understood Hindi. Everyone knew Kannada and Marathi.
Update: Belgaum is not a prosperous region and politically weak. Sarayu mentioned that since Mumbai is nearby, Belagavis go there for work. So, they learn Hindi and Marathi. This also means that youth leave for work to other cities and mostly old generations remain in the Belgaum.
Payments. After using cash everywhere in Maha, I was surprised that barring a few auto drivers UPI was widely accepted in Belgaum. It looks like Belgaum is on its way to become a smart city.
Stay. No hostels. Multiple hotels are available, but all average or below average. I chose a hotel in the market area.
Food. The first thing was Belagavi Biryani. I tried it at a famous chain called Niyaaz Restaurant (Main Branch) 📍. I’d rate it 3.5/5. A friend (met her later during the week) mentioned that Niyaaz is over-rated.
The second thing was a dessert called Belagavi Kunda. I loved it! All the sweet shops sell Kunda. You can buy 100 grams of it for ₹20 or ₹25.
Last thing was to try the Maharashtrian thali available at indie restaurants.
Tourist Points. There are two points in the Belagavi Fort area: Kamala Basadi 📍 and Safa Masjid 📍.
Story time. After wrapping up from work, I left for the Fort area. I reached my destination after walking 3 KMs and found that it was an army area. No one stopped me from going inside. I went to the mosque. Unfortunately for me, it was inside the Army protected area. My phone said 9 PM. The guards at the entry started questioning me. They let me go only when they were satisfied that I wasn’t a bad element. Of course I couldn’t see the masjid. Apparently, it is open to the general public only during Bakrid. Afterwards, while coming out of the main area where no one stopped me while entering, another army personnel stopped me. Same barrage of questions came my way. On top of that, he checked my bag and photographed my ID. It turns out that that area becomes a no-movement zone after 9 PM. Next day I visited the Kamala Basadi and ended my tourist mode. 😅
Vibe. Most of my interaction was during transactions: shop owners, hotel folks, or auto drivers. I almost never got a polite response from anyone. They were not rude either. There was no warmth in the interaction. No one smiled. Just plain dry transaction. Belagavi is the first city where I felt like that.
Story time. When I went to Kamala Basadi after the night incident, I went by an auto. I was already apprehensive of the auto guys from my 1st night. This guy turned out to be opposite. Abdul Rashid Shaikh and I talked about my reasons of coming to the town. The ₹250 story came up. He cursed all these auto guys. And the discussion (mostly me listening) went to good and bad deeds. He then took me to the Basadi and the Masjid (only to be denied entry by the army person). He then dropped me back to the Hotel. All within ₹200. I would have paid more if I had done these trips separately. This guy also suggested to skip a few cities from my itinerary. If you are in Belagavi, call for him on +91-8880866313.
I usually stay at hostels. Staying at a hotel was an experiment. Sadly, it was a failure. As a solo traveler, staying at hostels is more enjoyable than at a hotel. And usually hostels are in the cities with multiple points to explore or with good vibes. Belagavi had none. Thus, I decided to go to the cities with hostels.
Next stop: Hampi.
]]>tf.Variable
is for dense variables. For sparse variables, authors created HashTable operations.Variable
.)Variable
construct freezes the shape of the matrix throughout the training/serving process. Thus, is it a fixed-size embedding.O(1)
; Insertion: amortized O(1)
(I will skip the experiment setup and jump over to the results.)
Is the 2nd figure correct? The 5hr sync interval model should degrade till the sync happens. After sync, it should have similar AUC as other models. It should then degrade again from that point until the next sync. That is not happening here. What am I missing?
Below excerpt explains the reason:
Overall, this paper adds to my belief that a successful system requires clever engineering.
]]>One does not get mentoring lessons at school. There is not enough time to read books on effective coaching. My only guides have been two specific rules.
First is thinking about how I would have wanted my mentor to coach me. And then coach my mentee the same way.
The second is to observe my current mentors. Notice the techniques that enable my growth. Inculcate these methods in my mental model. Similarly, spot where they are ineffective and learn to avoid them.
Recently, my boss inadvertently showed me a flaw in my first rule.
Good design is thorough down to the last detail. Nothing must be arbitrary or left to chance. Care and accuracy in the design process show respect towards the consumer.
- Deiter Rams’ 8th principle of Good Design.
That embodies my personality. I believe that my work should be proper (read, perfect). And, at times, I become rigid about maintaining that standard. Thanks to this, my output is usually of good quality (not bragging 😅). That satisfies me. The gratification keeps me intrinsically motivated. Thus, I continue to work like this.
My rule assumes how I would have wanted my mentor to coach me. That means that I presume my coach to have equally high standards. It is a flaw. Since everyone is different, my way of operating does not work for everyone.
I also have a patience problem. When I think I can do something faster and the other person takes more time, then that annoys me. I have worked on this quite a lot in the last few years. But there is room for improvement.
My overarching goal is to be a good leader. I believe a good leader is also an effective mentor and coach. So, I am actively going to make myself a good mentor.
The following are the next steps for me:
Let’s see how it goes.
]]>Embedding layer: 100 dim
torch.nn.Embedding(num_embeddings=all_hotels, embedding_dim=100)
torch.cat()
)
2 BiLSTM layers
torch.nn.LSTM(
input_size=1530,
hidden_size=512,
num_layers=2,
bidirectional=True)
)
torch.nn.Flatten()
)4 ReLU dense layers
l1 = torch.nn.ReLU(torch.nn.Linear(512*2, 512))
l2 = torch.nn.ReLU(torch.nn.Linear(512, 256))
l3 = torch.nn.ReLU(torch.nn.Linear(256, 128))
l4 = torch.nn.ReLU(torch.nn.Linear(128, 1))
Metric formulation
\[\text{Sim Index @ x} = \frac{\sum_{i=1}^{H} \text{sim@x}(\text{top-10 hotels}, i)}{H} \\\]\(\text{Precision@k}\) or Hit Ratio: fraction of users for which the booked hotel was among the top-k recommendations.
\[\text{Precision@k} = \frac{U_{hit}^k}{U_{all}}\]