Playground

Lognormal to Normal Distribution

2024-01-14T00:00:00+00:00

The Normal and lognormal distributions are fundamental concepts in statistics. I recently used the relationship between these two distributions in a project. In this blog post, I want to share what I learned.

Outline

Normal & Lognormal Distributions
Lognormal to Normal
Normal to Lognormal
Conclusion

Normal & Lognormal Distributions

The normal distribution is also called the bell curve or Gaussian distribution. The bell height represents the mean position, and the bottom width of the bell represents the spread of values (standard deviation). Thus, the shape changes as we change mu (\(\mu\)) and sigma (\(\sigma\)). The \(\mu\) is the mean or average of the sample, and \(\sigma\) is the standard deviation. We denote a normal distribution as:

\[{\mathcal {N}}(\mu ,\sigma ^{2})\]

Find more details about the normal distribution on Wikipedia. Here are two ways of defining a normal distribution in Python.

Using python stdlib

1
2
3
from statistics import NormalDist
mu, sigma = 5, .5
norm_dist = NormalDist(mu, sigma)

Using scipy

1
2
3
import scipy.stats as stats
mu, sigma = 5, .5
norm_dist = stats.norm(mu, sigma)

We get a lognormal distribution when we apply exponentiation to the normal distribution. The result is a lopsided curve. It means that there is a longer tail on the right side, where larger values occur. We denote the lognormal distribution as follows:

\[{\displaystyle \ X\sim \operatorname {Lognormal} \left(\ \mu _{x},\sigma _{x}^{2}\ \right)\ }\]

Since the log of the lognormal distribution is a normal distribution, we can denote the relationship as follows:

\[{\displaystyle \ln(X)\sim {\mathcal {N}}(\mu ,\sigma ^{2})}\]

Find more details about the lognormal distribution on Wikipedia. We define a lognormal distribution in Python as follows. The Python stdlib does not have a lognormal implementation.

1
2
3
4
import numpy as np
import scipy.stats as stats
mu, sigma = 5, .5
norm_dist = stats.lognorm(s=sigma, scale=np.exp(mu))

Note: the scipy.stats.lognorm takes mu and sigma of the underlying normal distribution from which we derive the lognormal distribution. While providing the scale parameter, we take the exponentiation of the mean of the normal distribution. I found the documentation inadequate in explaining the parameters. This SO question has answers that discuss the meaning of the parameters.

Here is how both the distributions look for the same mu (\(\mu\)) and sigma (\(\sigma\)).

Code to generate the below plot.

Ooty: Friendships, Travel and Painting 📍

2023-12-10T00:00:00+00:00

I spent the last two weeks of August in Ooty📍, A hill station in Tamil Nadu. Transitioning from Chennai’s heat to Ooty’s cold within a day was drastic. My hoodie was happy to be out from the bottom of my bag.

I stayed at Zostel📍. Advice: get the ten-bed dorm. It has access to the balcony. You can see all of Ooty from the balcony. At night, the lit-up Ooty City takes your mind away from everything.

On my workations, my stay in any city has taken one of the following tracks: make friends and go with the flow or explore solo. Ooty was the city of friends.

I reached the hostel early in the morning. While I waited for my dorm bed to get cleaned up, I met Arvind and Sriram. All of us were in the same dorm. Enjoying the Ooty view from the balcony, we had an engaging conversation. It was about introduction, backgrounds, work, and protocols. Introducing protocols is out of scope for this post, but do follow the shared link. The conversation ended when my time for a work meeting got closer.

There was almost a routine to our days. I get up in the morning and start my work after breakfast in the Hostel Cafe. These guys would chill or go out to some tourist points. I would have lunch and continue working till the evening. After dinner, all three of us and others (new people came and went every other day) would sit around the bonfire and talk. Sometimes we sang and played music. I would also try to produce some sounds with my Ukulele.

One evening, wrapping up early from work, three of us went for a movie. On all my travels, I have watched movies on hostel TVs, but never once have I gone to a movie in a theatre. It was a novelty experience. Let me describe the whole scene.

The movie in focus was Blue Beetle (judge all you want 😅). The only way to book the ticket was at the theatre called Assembly Rooms📍. We reached there ten minutes early. Including us, there were only five people present. The booking window was not yet open. At the show time, the ticket guy informed us that he needed at least six people to start the show. Our strength was still five. We agreed to buy an extra ticket (each ticket was 180 bucks.) After waiting for five more minutes, he let us watch the movie. The same guy was also the movie operator and started the show. The whole theatre was ours. We watched the senseless movie from the 3rd row and made fun of the various scenes.

I had conversations with both of them on a range of topics. The topic of protocols continued. We also talked about green and sustainable energy (Arvind’s forte), their college life (they know each other from college), Peru (Arvind’s base), Ooty’s past (Arvind used to come here from Coimbatore in his childhood), quantified self, music, learning music, tradition of learning music or dance in Tamil families, and a lot more.

One by one, both of them left. It is always sad when someone leaves after you have spent time with them. It is a reality of life, but it is still sad to be left behind. Despite that, you keep going, and then it becomes normal again. You meet new people, and the cycle continues.

Ooty (and the whole of Tamil Nadu) is hard on solo travellers without transport. In every city, I rent a scooty to travel around. Ooty (read: Tamil Nadu) doesn’t allow renting vehicles. The weekend was here. I wanted to see Ooty. So, I asked for the touring taxis at the reception. The taxi guy quoted 2500 bucks for the whole day. It was time to make new friends.

I met Rajesh, who was also looking to share the taxi. Meghna and Gargee were the final two. And our impromptu travel group was ready. Although the taxi driver increased the rate to 3100 INR, it was still better for all four of us.

Rajesh is a PhD student in Geology. He was studying rocks somewhere near Ooty and decided to spend the weekend here. He is more interested in Climate Science and will join another PhD program in a few weeks.

Meghna and Gargee are childhood friends from Gwalior, Madhya Pradesh. Meghna works at an IT company in Pune. Gargee is an architect in Bangalore. She left her job and was moving to Ahmedabad to become a Landscape Architect. The reason for them coming to Ooty was to enjoy Gargee’s last trip from Bangalore.

We went to the following points on our tour.

It took us till evening to cover all these points. We asked our cab driver to drop us in the city market. It was a full-moon night. We checked out some shops in the market, ate our dinner at Adyar Ananda Bhavan - A2B 🍽️ and called it a night after warming up around the bonfire.

The morning, I accompanied Meghna and Gargee to the Bus Depot. They were leaving for Wayanad. After seeing them off, I went to explore the city. I saw a watch tower. On my way, I saw scores of people entering and leaving an alleyway. My curiosity made me go into the alley, and I saw a complete change in the landscape. I was standing in a fruit and veggie market. There were multiple alleys leading to different sections of the market: fruits, veggies, meat, dry fruits, flowers, and other grocery stuff. It felt like the Diagon Alley in the Harry Potter universe.

After coming back, Mugdha and Madhvi were painting the wall art in the common area.

A little backstory. My dorm was right next to the common area. On my first day in the hostel, the staff cleared and re-painted a wall in the common room for a new wall art. Mugdha came a few days later to pencil the outline. The outline consisted of the elements of Ooty: coffee plantation, toy train, toda tribe, rose garden, and pine forest. During the weekend, Madhvi arrived. Both of them started colouring the sketch. We became friends, and I started my apprenticeship under them. Soon it became a group of seven: Mugdha, Madhavi, Anshul, Sumit, Gautam, Vani, and me.

So, I started helping them with colouring after returning from the market. The last time I held a paintbrush was in the 9th grade, more than 14 years ago. I coloured different shades of roses, coffee beans, leaves, and grasslands. I enjoyed filling up those shapes using a paintbrush. I first added a colour coat without worrying about the brush strokes. Later, I painted over it to make it consistent. I aligned the strokes with the outline to make it coherent. I experimented with multiple ways of moving my brush. Slowly, I could achieve the same effect in fewer steps and less paint. It was a very calming activity. I used to leave more delicate stuff for my teachers. By the end, both of my teachers were proud of my work. 😁

Along the way, all of us talked extensively. Both Mugdha and Madhvi are from Mumbai and friends from college. Mugdha is into art, and Madhvi likes textile design. I learnt how students learn in art schools in India. Madhvi had been working in Design Thinking for school kids. She is going to get into textile design next. Mugdha is going to experiment more with painting and colours. One morning, Mugdha introduced me to Aahatein by Agnee. (This song was at the top of my Spotify Wrapped this year.) We discussed similar songs. Madhvi showed her skills on my Ukulele. I also got to know about the Kochi-Muziris Biennale. I witnessed the traces of previous iterations of Biennale when I went to Kochi a few weeks later.

Anshul, who works on APIs for clients, was trying to explain to Madhvi and Mugdha what he does. While trying to help him explain, we learnt that many women, like Madhvi and Mugdha, enjoy having nerdy talks and would date such men. Then the topic moved to interesting or weird dates many of us have experienced.

Gautam played a ten-minute movie called Zima Blue from the Love, Death & Robots animation series. Gautam is a volunteer teacher in a nearby village. He is visiting India for a few months, following which he’ll return to the US. Sumit was on a road trip on his KTM and headed back home to Pune from Ooty. We talked about many of his road trips. Vani was on a weekend trip from Bangalore.

I enjoyed painting so much that I extended my stay by a few more days. I, unfortunately, had to say goodbye to everyone before it was complete.

From the hostel, I took the bus to Conoor. And from Conoor📍, I took the toy train to Mettupalayam📍.

Fortunately, I got the window seat assigned to me. The train - running on steam - passed through multiple bridges, coffee plantations, and dark tunnels opening up to beautiful valley views. The train stops at two railway stations along the way. Both the stations had an old vibe to them. On the second stop, they also fueled up the engine with water.

At the end of this journey, I took another bus to Coimbatore📍 and reached my hotel (no hostels in Coimbatore ☹️). I met with two people here: Arvind and Guhan. Arvind, whom I had met at the beginning of this post, returned to Coimbatore after leaving Ooty. Guhan is a friend I made in Hampi earlier this year. The story of Coimbatore will continue in another post.

Visualizing a GroupBy (or a Bipartite Graph)

2023-11-21T00:00:00+00:00

Have you ever needed to present the output of a GroupBy or Pivot Table?

Will you display it as a table? Not everyone can grok it. It will also take time to walk people through the table.

You will format your table with colours (conditional formatting) to show peaks and bottoms. That will work. However, it will become dense as number of rows increase. Furthermore, this workflow involves exporting data from your respective data system (database or data lake) and importing it to Excel/Google Sheets. Thus, it is not feasible in all situations. One of those situations is what I faced.

I was doing an analysis of customer data at work. I wanted to see the distribution of cuisines in two subsequent orders. For example, the customer ordered Chinese food followed by South Indian in the next order. Because sequence matters for my analysis, Chinese to South Indian and South Indian to Chinese would be two separate rows. As you can imagine, a significant part of the GroupBy output contained these redundant pairs. It was difficult to derive any insights from it.

Bipartite Graphs to the Rescue

Fortunately for me, I was able to recall the bipartite graphs. Bipartite graphs model the relationship between two classes of objects. For example, think about the relationship between owners and their cars. An owner can own ore or more cars. An owner can not own other owners. Similarly, a car can not own other cars. A bipartite graph will only show a relationship between a vehicle and its owner (two different classes of objects).

It was perfect for my visualisation problem at hand!

However, to generate a presentable graph turned out to be slightly roundabout. This article is to document the process for my future self.

The Process

As expected, the NetworkX Python library had all the utilities available. The steps are as follows:

Get data
Define a networkx Graph.
Use bipartite_layout() to define the layout for a bipartite graph.
Draw the graph using draw().

There are more minor steps involved that we will cover during the deep dive. Since NetworkX plays well with the Matplotlib library, we have all the Matplotlib utilities available to us.

I will visualise the age-wise top causes of death according to WHO.

We start with the necessary imports.

1
2
3
4
5
import random
import pandas as pd
import networkx as nx

from matplotlib import pyplot as plt

We have to pre-process the data for the viz.

1
2
3
4
5
6
7
8
9
10
11
data = pd.read_csv("male.csv").set_index("cod").T
data.columns = ["cod_"+i for i in data.columns]
data = data.rename_axis('age_group').reset_index(drop=False)
data = pd.wide_to_long(
    data, stubnames="cod", i=['age_group'], j="cause", sep='_', suffix=r'[\w ,]+'
)
data.columns = ["percent"]
data = data.reset_index(drop=False)
data["percent"] = data["percent"].str[:-1].astype(float)/100
data = data[data.cause != "All Causes"]
data.head(2)

The data is ready. I wanted all the edges with the same start in the same colour. So I added an integer corresponding to each class using the below code. We will use this column to get a random colour for each label with a colour map.

1
2
3
4
# colors
node_dict = dict([(j, i) for i, j in enumerate(data['age_group'].unique())])
data["node_color"] = data["age_group"].apply(lambda x: node_dict[x])
data.head(2)

I am loading the data and converting the wide to the long format for NetworkX. Next, we define our graph using this data.

1
2
3
4
5
edges = [tuple(x) for x in data[['age_group', 'cause']].values.tolist()]
B = nx.Graph()
B.add_nodes_from(data['age_group'].unique(), bipartite=0)
B.add_nodes_from(data['cause'].unique(), bipartite=1)
B.add_edges_from(edges)

Below is how we visualise the graph.

1
2
3
4
5
6
7
8
9
10
11
# matplotlib variables
fig, ax = plt.subplots()
fig.set_size_inches(9, 6)

# First specify the nodes we want on left or top
# create a bipartite layout
left_or_top = data['age_group'].unique()[::-1]
pos = nx.bipartite_layout(B, left_or_top, scale=10)

# Pass that layout to nx.draw
nx.draw(B, pos, node_color='#A0CBE2', edge_color="white", width=1)

We define Matplotlib variables. Use bipartite_layout to get the required layout and draw the graph. Note that, without edge_color="white", we can stop at this step. We will get equal width, constant colour edges and nodes. The next few steps will fix the presentation aspect of the plot.

We colour the edges first.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# define random color map - https://stackoverflow.com/a/68459848/2650427
colors_ = lambda n: list(
    map(lambda i: "#" + "%06x" % random.randint(0, 0xFFFFFF), range(n)))
colors = colors_(len(data.age_group.unique()))

# draw each edge
edge_width_dict = (
    data[['age_group', "cause", "percent"]]
    .set_index(['age_group', "cause"])
)
for node in data[['age_group', "node_color"]].drop_duplicates().values:
    edges = B.edges([node[0]])
    color = colors[node[1]]
    edge_widths = [edge_width_dict.loc[i]["percent"] for i in edges]
    nx.draw_networkx_edges(
        B,
        pos,
        edgelist=edges,
        width=edge_widths,
        edge_color=color,
    )

We iterate through all the starting nodes and their corresponding colours. We get each point and its edges and colour them the same but vary their width according to the percent column.

Last configuration is the node labels and their alignment. Without this segment, all the node labels would be centre-aligned. A long string is truncated in the viz. I want to point out that neither the documentation nor Stack Overflow could help me here. My saviour was ChatGPT. It gave me a working example using draw_networkx_labels() that I modified as below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# left node labels alignment
for node_name in data['age_group'].drop_duplicates().values:
    node = {node_name: node_name}
    node_pos = {node_name: pos[node_name]}
    label_pos = nx.draw_networkx_labels(
        B, node_pos, labels=node, font_size=10,
        horizontalalignment='left',
        verticalalignment="bottom"
    )

# right node labels alignment
for node_name in data['cause'].drop_duplicates().values:
    node = {node_name: node_name}
    node_pos = {node_name: pos[node_name]}
    label_pos = nx.draw_networkx_labels(
        B, node_pos, labels=node, font_size=10,
        horizontalalignment='right',
        verticalalignment="bottom"
    )

plt.show()

Our Beautiful Plots

Time to see the results.

Male children mostly die due to Infectious and parasitic diseases, Respiratory infections, Maternal conditions, Neonatal conditions, and Nutritional deficiencies. Most teen and youth deaths (15-29 years in age) happen due to injuries. As men get old, serious ailments (Birth ailments, Cancer, Cardiovascular, Respiratory, and others) become more pronounced causes of death.

Females follow a similar distribution. One notable difference is that relatively few women die due to injuries. Is that the reason women live longer than men?

The plots effectively showed the common diseases for each age group. Of course, this plot only gives a summary. And the summary is what we wanted from this viz.

Shortcomings

The plots were 90% there. Unfortunately, there are a few flaws.

While it provides me with a summary, it does not tell me the strength of the relationship. In that aspect, it is similar to pie charts. And the internet is filled with articles about why pie charts are unhelpful plots.

Another issue is the random colour and edge width assigned to each edge. A node may be yellowish-green in colour. Even if the edge width is relatively higher, the edge will still not be prominent. I re-ran my code to get the version with the right colours. We could solve this by hand-selecting the colours and tuning the edge widths with a constant factor.

Conclusion

We wanted a summary visualisation of our GroupBy (or pivot table) output. To achieve that, we converted it into a bipartite graph and rendered it using Matplotlib.

There are flaws in this visualisation. The strength of the relationship is not apparent. Additionally, edge colour and widths need tuning to make the strong relationships prominent. Fixing these issues is a future work.

[Mini] How to Parse JSON in Spark without Knowing the Schema?

2023-07-08T00:00:00+00:00

Problem Statement

I have a JSON column in my DataFrame.

The JSON is in string format.
It is a nested JSON.
It is a large string.
I do not know the schema and want to avoid defining it manually.
All the JSONs follow the same schema definition.

I need to format it as a JSON object (struct) to extract anything out of it. How do I convert it into a struct?

Solution

Here is the solution if you are short on time. In the next section, I discuss it in more detail.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Spark 3.2.1 | Scala 2.12
import pyspark.sql.functions as F

# Sample json we will work with.
sample_json = """
{
  "lvl1":  {
    "lvl2a": {
      "lvl3a":   {
        "lvl4a": "random_data",
        "lvl4b": "random_data"
      }
    },
    "lvl2b":   {
      "lvl3a":   {
        "lvl4a": "ramdom_data"
      },
      "lvl3b":  [
        {"lvl4a": "random_data"},
        {"lvl4b": "random_data"}
      ]
    }
  }
}
"""

# Spark dataframe with json column
df = spark.createDataFrame([(sample_json,)]*4, ["json_data"])

# determine the schema
json_schema = F.schema_of_json(df.select(F.col("json_data")).first()[0])

# converting json to struct
df = df.withColumn("json_data_struct", F.from_json("json_data", json_schema))

Details

We will use pyspark.sql.functions.schema_of_json to do our dirty work of determining the schema.

Just like any other column-based function, I expected this function to work on a column. So I tried this as below:

df = df.withColumn("sch", F.schema_of_json(F.col("json_data")))

It threw the below error:

AnalysisException: cannot resolve 'schema_of_json(json_data)' due to data type mismatch: The input json should be a foldable string expression and not null; however, got json_data.;
...

I did not know what is a foldable string. The data type of the json_data column was a string. The ChatGPT also suggested the same way of using this function. :)

The documentation and multiple Stack Overflow answers [1, 2, 3] helped me reach an explanation.

The schema_of_json needs a single string instead of a column. So I extracted one JSON string from the column and passed it to the function. This is how I did it:

1
2
json_string = df.select(F.col("json_data")).first()[0]
json_schema = F.schema_of_json(json_string)

The end.

Fitness Dashboard with Google Fit

2023-06-09T00:00:00+00:00

In this post, I will describe my Google Sheets dashboard, where I track all fitness-related aspects.

I like self-tracking. I want to track my productivity and find potential improvements. What gets tracked also gets measured. My interest in the Quantified self has evolved.

I started with Gamification of Life, where I assigned points to everything I did. It got too overwhelming after a year.

I lack at maintaining relationships with friends and family. For insights, I analysed my chats’ metadata - Chatting Up and Chatting Up - Part II. It was interesting to see how my interactions changed over time with friends. I also wanted to play with the chat content, but NLP capabilities at that time weren’t enough to deal with Hinglish text.

I track my call logs and someday would like to analyse them.

Two years back, I even started making an app to track anything. You can read more about the app here:

Eventually, other commitments and curiosities caught up, and I couldn’t finish it. 😅

I also track my spending and habits. I am sure I have skipped a few more.

Tracking Fitness

Fitness was another frontier where I tried many things. Tracking was always enabled using Google Fit and Maps, but I did not do anything with the data. Google Fit analytics on the app was helpful, but I wanted more. Time was difficult to find between work, travel, and other interests. I wanted something quick and easy to develop/maintain.

Introducing: Do More With Less (DMWL). It is a trend at work where you identify and prioritize the tasks that are quick to execute with good ROI.

I googled previous work in this direction. I found this medium article that pointed to this more helpful article doing what I wanted - Export Google Fit Daily Steps, Weight and Distance to a Google Sheet[code from the blog is available here: google-fit-to-sheets/Code.gs]. It made almost everything straightforward - setting up the app, auth, pulling and formatting data. Chat GPT complemented my lack of knowledge of Javascript to write code in Google Sheets.

Update: 19th July

I had to do set up the credentials multiple times now. Add the steps here so quick reference. In depth instructions are on apps-script-oauth2/README.md.

Open script editor by going to Extensions > Apps Scrip. It will open a new apps script project.
Name the project. Click the + in the Libraries section. In the Add a Library dialogue, add 1B7FSrk5Zi6L1rSxxTDgDEUsPzlukDsi4KGuTMorsTQHhGBzBkMun4iDF as the Script ID. This will find the Google OAuth2 Lib. Select the latest version and save.
Go to Project Properties from the file menu and make a note of the Script ID. This is the ID for our new project. We will need it later.
Open the Google API Console.
Create a new project and name it.
Go to Enable APIs and Services and find the Fitness API.
Go to Keys and create an OAuth Client ID. While creating the consent screen, only add the product name. Select “Web Application” in the application type. In the redirect URL add https://script.google.com/macros/d/{SCRIPTID}/usercallback and replace the {SCRIPTID} with the Script ID copied in step 3. Note down the client id and client secret created at the end. We will use these in our script. [You can read more on Setting up OAuth 2.0.]

Update: 28th July

The script stops working after a few days. This has happened twice with me. I get the following error:

Error: Access not granted or expired.
Service_.getAccessToken
@ Service.gs:518

I still haven’t found the solution, but I have a few leads:

I hypothesise that it is related to oauth2 details being stored in the Properties Service. The token is empty when I print the Properties.
This Properties Store has an expiry (likely 1 hour). I couldn’t find a way to update the oauth2 details before the expiration. Tried multiple ways after deleting and setting. This SO answer didn’t help either.

End of the Update

Code

In this section, I will discuss the coding involved. You can skip to the next section for the dashboard.

Here is how the code flow is:

Get today’s date.
Get all the specified metrics for the date. I care about the following specific events: step count, weight, heart points, and all logged activities.
- Step counts: com.google.step_count.delta
- Weight: com.google.heart_minutes
- Heart Points: com.google.weight.summary
- All logged activities: com.google.activity.segment
Set the precision of all the numbers and impute null values with zero.
Get the spreadsheet object using getActiveSpreadsheet() and append the data on the last empty row using getLastRow().
Copy the cell formatting of all the cells from the row before using the copyTo(destination, options) function.
Copy the rolling avg. formulae from the row before, again using the copyTo() function.

Google Fit API

It is dense! Extending it to my signals required scouring over the docs and multiple SO answers.

The first helpful link was: Users.dataSources: list. It gave me all the data points I can ask for from the Fit API. The next challenge was discovering the schema and what different fields meant. After several hit-n-trials, I found the Activity Types page. It gave me the required ID for each activity.

Update: 28th July: I stumbled upon the guide to the REST API of Fit.

I hope you will find these links useful.

Next Features

There are two immediate hurdles I need to cross.

I have to authorize the app every day to call the API. A quick search told me that the token expires after an hour. I have to use a parameter called expires_in to refresh the token. Unfortunately, I could not figure out how to use it.

Similarly, I have to call the function daily (by pressing a button from the menu). I can automate it through the time-driven (clock) trigger [ref: triggers]. The problem is the token expiration. The trigger will fail the next day because the token is stale.

Fitness Dashboard

Time for the final results.

Google Fit gives you a Heart Point (HP) for each minute of activity you do. Here is how mine looks. My heart points mainly include walking, working out, and a little bit of swimming.

The red line is an aggregated line to make it easy to see the trend.

The yellow-shaded regions highlight the days I was workationing. During these days, my heart points rarely reached zero. Whereas in the non-shaded periods, I frequently hit zero. Those zeroes are my two rest days after five days of working out.

My heart points are also cyclic in nature. Whenever I am home, my only regular activity is exercising with a two-day break every week. Walking becomes an occasional affair.

Let’s look at these heart points more closely.

Can you notice the complementary nature of the two graphs? First, look at the red line and then focus on blue.

During the travel period, my HP fluctuated because of the crazy number of daily steps (a proxy for walking) and irregular/short workouts. And when I am at home, I go crazy with my exercise. During May, a few things at home led to more walking and small and irregular workouts.

That’s the end of my DMWL version of the Fitness dashboard.

Next Steps

The first stage of my dashboard is complete. I will iteratively update it to get more out of it. I discussed the tech improvements in the coding section. As a part of my fitness tracking journey, here is what I want to do.

I want to come up with some standards or thresholds for myself. It could mean saying something like reaching 30-40 HP daily or working out for fifty minutes five days a week. I do not know what these standards will look like.

I want to use this dashboard to motivate myself. It can be a tracker for fitness-related habits. It should all be automated. Consequently, I want this dashboard to help me inculcate new fitness-related habits. Towards this vision, the next step is to add more activities to the mix, namely meditation, cycling, and swimming.

I live a healthy lifestyle. Another goal of this dashboard is to observe how different decisions in my life impact my health and productivity. That will help me course correct. Some directions are:

What is the impact of my diet on my weight? For example, when does a high-calorie diet typically reflect in my physique? The current heuristic-based answer is two weeks but needs validation from data.
Sleeping hour vs the workout efficiency the next day. Does routine matter? If yes, how and where?
How does my fitness change during my travels?
Does fitness impact my productivity?

There are more, but these are the important ones in my mind.

Update: 19th July: I added meditation and cycling under the activities section. In the nutrition section, I added calories burnt and water intake. Calories burnt is Fit’s approximation. The water intake is tracked manually in the app like the activities.

Mechanisms for Data Science Projects

2023-05-12T00:00:00+00:00

Good intentions never work, you need good mechanisms to make anything happen - Jeff Bezos.

A mechanism is a process where

You create a tool;
Drive adoption of the tool;
Inspect to correct course.

Meta-Checklist for Projects

1-pager describing a map to the destination (1-7 days)
- Intent or why. Quantify the problem.
- Desired outcome. Business metric.
- Deliverable. No need for it to be detailed.
- Constraints. How not to solve the problem.
Timebox the project. Based on the timeline, design a solution that fits. Ref: timebox section.
Literature review
- It does not have to be exhaustive.
- Quickly identify approaches that have worked and build on them.
- Refer to the lit. review section.
Reviews
- Schedule once you have the results from the initial experiments.
- It helps with catching blindspots or critical errors.
- Focus points
  - Input data and features
  - Offline evaluation
  - Room for improvements.
Set up the work environment. Read: How to Set Up a Python Project For Automation and Collaboration
Consistent documentation during the project.
- Document whatever is not in the code.
- Create documentation like an applied research paper: motivation, lit review, data, methodology, results, and next steps.
- It helps with replication.
- Read more here: Why You Need to Follow Up After Your Data Science Project.
Have informal stand-ups with the team:

Share unusual findings; Discuss ideas; Get help on bugs; Ask for reviews; etc.
Regular stakeholder communication
- Check in regularly with them.
- It ensures that the deliverable aligns with the overall goals.
- It is also a source of feedback and clever suggestions.
Read more here:
- What I Do Before a Data Science Project to Ensure Success
- What I Do During A Data Science Project To Deliver Success

Timeboxing Projects

It makes you focus on the most crucial tasks.
Timebox: stretch goals wrt the project.
Estimate: Upper bound of effort needed.
An estimate to go from timebox to estimate: multiply by 1.5 - 3.0.
Most aggressive timebox: halve the time spent on a similar project. Create an MVP. Quick iteration cycles. Intense.
Comfortable-yet-challenging timebox: reduce the time by 10-20%. Good default.
Standard timebox: for open-ended projects. 2 weeks lit. review, 4-8 weeks for prototype building, and 3-6 months for production.

Executing Projects

Mechanism to execute projects with high confidence.

Pilot and copilot for each project.
Pilot: main project owner.
- Responsible for success/failure
- Own and delegate as required.
Copilot: helps the pilot stay on track, identify critical flaws, and call out blindspots.
- Periodic check-ins
- Reviews document drafts and prototypes
- Mandatory code reviewer
Copilot has (more) experience in the problem space.
Copilot spends 10% of the pilot’s effort.

Literature Review

Always start the project with a literature review.
Read papers relevant to the problem.
Start with applied research: applied-ml.
Reviewing papers for problem understanding
- Formulation
  
  Classification, regression, or something else?
- Data processing
  
  How was data excluded, preprocessed, and rebalanced? How were labels defined? Was a third neural class added? How were labels augmented, perhaps via hard mining?
- Evaluation process
  
  How was the training and validation set created? What offline evaluation metrics did they use? How did they improve the correlation between offline and online evaluation metrics?
How to go through each paper is discussed in the next section.

3-Pass Approach for Reading Papers

For single paper

Scan the abstract and conclusion to understand if the paper is useful. If it does, then skim through the headings to identify the problem statement, methods, and results.
In the 2nd pass, highlight the relevant sections. Helps in quickly spotting the important bits later. Take notes. For most of the papers, 2nd pass is enough.
Do a 3rd pass to cement the knowledge.

For multiple papers from the same domain

Do 1st and 2nd passes on each paper.
In the 3rd pass, consolidate common concepts across papers into a single note and compare the pros and cons. Doing this helps identify gaps in my knowledge. If there are gaps, then revisit the paper.

Find more here: How Reading Papers Helps You Be a More Effective Data Scientist.

Collaboration and Standard Practices

Create shared libraries for oft-used data operations.
- It encourages the team to contribute and thus leads to collaboration and code reviews.
- It nudges people towards a team mindset.
Have a single repo with training, evaluation, and inference code in one place.
- Everybody works and reviews the same code.
- It helps in knowledge sharing.
- It also slows down the speed, but the pros outweigh the cons.
Read more: Data scientists work alone and that’s bad.

I will keep updating/adding to this list as I read/experiment more.

Belagavi or Belgaum 📍

2023-01-16T00:00:00+00:00

Entering Belagavi

My first interaction in the city: ₹250 for an auto for 2 KMs. That’s just robbery! He came down to ₹100 after I told him off. Then found another auto guy who said ₹80. (This was also high, but I was tired.)

Belagavi is the (proposed) 2nd capital of Karnataka. It is a controversial city: both Maharashtra (Maha) and Karnataka want it within their borders. I don’t understand this border dispute. 😞

Update: A journalist friend, Sarayu, gave me a few pointers here. Language is the reason for this fight. Most of North Karnataka up to Goa know Hindi. Thanks to Nizam rule and the districts that border MP and Maha. And Hindi becomes more important because Bombay is closer than Bangalore. On top of that, this region becomes strategically important because it is geographically closer to upcoming districts like Karwar. There is also a spurt of educational institutions like IIT Dharward/Raichur coming up.

Language. Almost everyone understood Hindi. Everyone knew Kannada and Marathi.

Update: Belgaum is not a prosperous region and politically weak. Sarayu mentioned that since Mumbai is nearby, Belagavis go there for work. So, they learn Hindi and Marathi. This also means that youth leave for work to other cities and mostly old generations remain in the Belgaum.

Payments. After using cash everywhere in Maha, I was surprised that barring a few auto drivers UPI was widely accepted in Belgaum. It looks like Belgaum is on its way to become a smart city.

Stay. No hostels. Multiple hotels are available, but all average or below average. I chose a hotel in the market area.

Food. The first thing was Belagavi Biryani. I tried it at a famous chain called Niyaaz Restaurant (Main Branch) 📍. I’d rate it 3.5/5. A friend (met her later during the week) mentioned that Niyaaz is over-rated.

The second thing was a dessert called Belagavi Kunda. I loved it! All the sweet shops sell Kunda. You can buy 100 grams of it for ₹20 or ₹25.

Last thing was to try the Maharashtrian thali available at indie restaurants.

Tourist Points. There are two points in the Belagavi Fort area: Kamala Basadi 📍 and Safa Masjid 📍.

Story time. After wrapping up from work, I left for the Fort area. I reached my destination after walking 3 KMs and found that it was an army area. No one stopped me from going inside. I went to the mosque. Unfortunately for me, it was inside the Army protected area. My phone said 9 PM. The guards at the entry started questioning me. They let me go only when they were satisfied that I wasn’t a bad element. Of course I couldn’t see the masjid. Apparently, it is open to the general public only during Bakrid. Afterwards, while coming out of the main area where no one stopped me while entering, another army personnel stopped me. Same barrage of questions came my way. On top of that, he checked my bag and photographed my ID. It turns out that that area becomes a no-movement zone after 9 PM. Next day I visited the Kamala Basadi and ended my tourist mode. 😅

Vibe. Most of my interaction was during transactions: shop owners, hotel folks, or auto drivers. I almost never got a polite response from anyone. They were not rude either. There was no warmth in the interaction. No one smiled. Just plain dry transaction. Belagavi is the first city where I felt like that.

Story time. When I went to Kamala Basadi after the night incident, I went by an auto. I was already apprehensive of the auto guys from my 1st night. This guy turned out to be opposite. Abdul Rashid Shaikh and I talked about my reasons of coming to the town. The ₹250 story came up. He cursed all these auto guys. And the discussion (mostly me listening) went to good and bad deeds. He then took me to the Basadi and the Masjid (only to be denied entry by the army person). He then dropped me back to the Hotel. All within ₹200. I would have paid more if I had done these trips separately. This guy also suggested to skip a few cities from my itinerary. If you are in Belagavi, call for him on +91-8880866313.

I usually stay at hostels. Staying at a hotel was an experiment. Sadly, it was a failure. As a solo traveler, staying at hostels is more enjoyable than at a hotel. And usually hostels are in the cities with multiple points to explore or with good vibes. Belagavi had none. Thus, I decided to go to the cities with hostels.

Next stop: Hampi.

[Summary] Monolith: Real-Time RecSys With Collisionless Embeddings

2022-10-31T00:00:00+00:00

Paper link: Monolith: Real Time Recommendation System With Collisionless Embedding Table

Abstract

Real-time RecSys are important when customer feedback is time sensitive (eg: TikTok short-video ranking).
The production-scale DL frameworks (PyTorch, TensorFlow) are designed with separate batch-training and model serving stages. This makes online training difficult.
Presenting Monolith for online training:
- Collisionless embeddings with expiry parameter and frequency filtering to reduce memory footprint
- Online training architecture with fault-tolerance in parameter server
Part of BytePlus Recommend.

Data in RecSys

For many businesses driven by RecSys, better CX = real-time RecSys.
Information from a user’s latest interaction become primary input as it’s the best signal of a user’s future interest and behavior.
DL in RecSys
DL in industry RecSys faces problems because of the real-world data.
Data is different from CV/NLP tasks:
- Features are mostly sparse, categorical, and dynamically changing.
- Concept Drift: Training data distribution is non-stationary. (ref: A survey on concept drift adaptation, 2014)

Sparsity and Dynamism

RecSys data has a lot of categoricals (eg: customer id, item id, item type, etc)
Categorical features are sparse (eg: a user only buys limited items).
Feature engineering of categorical features: map them to a high-dimensional embedding space.
Issues with embeddings for categoricals:
- Users and items are orders of magnitude larger than word-piece tokens in LMs. This enormous embedding table would hardly fit in memory.
- As more users and items are added, the size would increase further.
Current solution: Low-collision hashing to reduce the memory footprint and to allow the growing of IDs (user or item)
- Assumptions:
  - Embedding table is distributed evenly in frequency. It is rarely true because only a small group of users or items have high frequency.
  - Collisions are harmless to model output. But it is detrimental because organic growth in embedding table size leads to more collisions.
- Ref: Core Modeling at Instagram
- Ref: Deep Neural Networks for YouTube Recommendation, 2016
Thus, natural and constant demand to elastically adjust the users and items a RecSys tries to book-keep.

Concept Drift

Underlying user distribution is non-stationary: user interests change with time (even during sessions).
More recent data is more likely to predict change in user’s behavior.
Mitigating concept drift: serving model should be updated as close to real-time as possible to reflect the latest user interests.

Parameter Server (PS)

Worker machines compute the gradients.
PS machines store parameters and updates them according to gradients.
Two kinds: 1) training PS; and 2) serving PS. Training PS holds training parameters. Once training is complete, it is synced to Serving PS.
Two types of parameters:
1. Dense: weights/variables in DNN; and
2. Sparse: embedding tables corresponding to sparse (categorical) features.
Since both dense and sparse features are part of the TensorFlow Graph, Monolith stores them on the PS.
The tf.Variable is for dense variables. For sparse variables, authors created HashTable operations.

Hash Table

Representation of embeddings in TensorFlow and its limitation
- The tf.Embedding layer uses variables to represent the dense embedding vectors. (the embedding matrix is of type Variable.)
- The Variable construct freezes the shape of the matrix throughout the training/serving process. Thus, is it a fixed-size embedding.
- As IDs increase with time, since the table size is fixed, ID collisions (while updating/using the dense embedding vector) would increase.
Authors implemented a new key-value HashTable.
- Hashing algorithm: Cuckoo hashing[Visualization + Explanation - Youtube]
- Lookup: O(1); Insertion: amortized O(1)
- Implemented as a TensorFlow resource operation (it likely means a TensorFlow custom layer).
- Lookups and updates are implemented as native TF operations.
Naive insertion: insert every new ID in the HashTable. Will deplete memory quickly.
Insertion by frequency
- IDs (user, item, etc) have long-tail distribution.
- Infrequent IDs will have underfit embeddings because of less training data.
- Model quality will not suffer from removal of these IDs.
- Filter by a threshold of occurrences before insertion.
- The threshold is a tunable hyperparameter for each model.
- Also use a probabilistic filter (didn’t expand on it)
Insertion by staleness
- Many IDs are never visited (user inactive, out-of-date item)
- Set a expiry time for each ID.
- The expiry time is tunable for each embedding table: different tables will have different sensitivity to historical information.

Model Training

Training Engine in Monolith

Engineering steps
1. User logs (click, like, buy) go to Kakfa.
2. Model features are present in the another Kafka (didn’t discuss what features)
3. Create the training example by joining the features with user logs using a Flink job.
  - First, check for the data in in-memory cache;
  - If not found, then go to on-disk key-value storage (happens in cases when user feedback arrives after days and in-memory cache is cleared to free-up the memory)
4. Push the created training example to a 3rd Kafka queue.
5. Push data from the 3rd queue to HDFS for offline training mode.
6. Trigger online or offline training
7. Push the updated parameters to the Training PS
8. Sync the Serving PS with the Training PS
Batch training stage
- Ordinary TF training loop
  1. Training worker reads a mini-batch from storage.
  2. Request parameters from PS.
  3. Compute a forward and backward pass.
  4. Push the updated parameters to training PS.
- Only train for a single pass over the data. (to mimic the online training phase?)
- Useful when: model architecture is modified and require retraining.
Online training stage
- Triggered when the model is online.
- Steps:
  1. Training worker consumes real-time data from a Kafka queue.
  2. Update the parameters in the training PS.
  3. Push the updated parameters to training PS.
Negative sampling
- To handle the highly skewed negative to positive sample ratio.
- It changes the underlying distribution of the trained model: higher probability of making positive predictions.
- Apply log-odds correction during serving to ensure the online model is an unbiased estimator of the OG distribution. (ref: Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data, 2021)
Parameter sync. between training and serving PS
- Production models are TB in size.
- Replacing all the parameters will take time.
- It will also consume network bandwidth and extra storage (need to store the new parameters before replacing the old ones).
- Solution: incremental periodic parameter sync.
  1. Sparse features (aka embedding tables): Sync the keys whose vectors updated during the last 1 minute.
  2. Dense variables (aka model weights): model weights move much slower because the momentum-based optimisers take more time to build momentum over the big data. Thus the sync frequency is 1-day. The authors found the stale weights tolerable.
Fault tolerance: periodic model snapshots
- Trade-off between: model quality (because of the loss of recent updates) and computation overhead (copy-pasting TB of data)
- Snapshot frequency: 1-day. Experiments revealed that performance degradation was tolerable.

Evaluation

(I will skip the experiment setup and jump over to the results.)

The Effect of Embedding Collision

Model with collisionless embedding vectors consistently outperform the one with collision.
Independent of training epochs and concept drift (non-stationary training data)

Online Training vs Batch Training

Online training vs Batch training on Criteo dataset.

Different sync intervals for online training

Online training has better performance than the batch training.
- AUC of online training models: evaluated by the following shard of data.
- AUC of batch training models: evaluated by each shard of data (?)
- General AUC delta ranged between 0.20 (5hr interval) to 0.40 (30 min interval).
Smaller parameter sync interval (or higher parameter sync freq.) performs better than the larger intervals.
Based on these results, best sync frequency for sparse features that the systems could endure was 1 minute.
- Assuming 100,000 IDs with 1024 vector size are updated each minute: ~400 MB (4 KB * 100,000) network transfer per minute.
Sync frequency for dense features is 1-day (every midnight) as they update slowly.

Is the 2nd figure correct? The 5hr sync interval model should degrade till the sync happens. After sync, it should have similar AUC as other models. It should then degrade again from that point until the next sync. That is not happening here. What am I missing?

PS Reliability

Hypothesis: minute-level parameter syncing should mean frequent snapshots.
Wrong. Observed no loss in the model quality even with 1-day snapshot interval.
Below excerpt explains the reason:
Lesson: don’t take frequent snapshots and save resources.

Summary Conclusion

The paper proposed the following:
1. Cuckoo HashMap based collisionless embedding tables
2. Online training and parameter sync architecture
With extensive experimentations (both offline and online) they showed that:
- Collisionsless embedding table has a positive impact on the model quality (AUC gains ranged from 0.20% to 0.40%)
- Online training performs better than batch training in RecSys setting.
- Higher parameter sync freq is better (1 minute for prod systems); and
- It is okay to have a smaller parameter snapshotting frequency (1-day for prod systems)
This paper was a write-up of engineering tricks that ByteDance employed to build their RecSys.
The few nuggets of ML that I noticed:
- Apply log-odds correction to the data in online serving to make up for negative sampling.
- Online real-time model at ByteDance is a multi-tower architecture where each tower is responsible for learning a special kind of user behavior. (Is than an allusion to multi-objective ranking through different towers?)

Overall, this paper adds to my belief that a successful system requires clever engineering.

Learning How to be a Mentor

2022-10-13T00:00:00+00:00

Some people are born coaches. I am not one of them.

One does not get mentoring lessons at school. There is not enough time to read books on effective coaching. My only guides have been two specific rules.

First is thinking about how I would have wanted my mentor to coach me. And then coach my mentee the same way.

The second is to observe my current mentors. Notice the techniques that enable my growth. Inculcate these methods in my mental model. Similarly, spot where they are ineffective and learn to avoid them.

The Flaw

Recently, my boss inadvertently showed me a flaw in my first rule.

Good design is thorough down to the last detail. Nothing must be arbitrary or left to chance. Care and accuracy in the design process show respect towards the consumer.

- Deiter Rams’ 8th principle of Good Design.

That embodies my personality. I believe that my work should be proper (read, perfect). And, at times, I become rigid about maintaining that standard. Thanks to this, my output is usually of good quality (not bragging 😅). That satisfies me. The gratification keeps me intrinsically motivated. Thus, I continue to work like this.

My rule assumes how I would have wanted my mentor to coach me. That means that I presume my coach to have equally high standards. It is a flaw. Since everyone is different, my way of operating does not work for everyone.

Patience is a Virtue

I also have a patience problem. When I think I can do something faster and the other person takes more time, then that annoys me. I have worked on this quite a lot in the last few years. But there is room for improvement.

What is Next?

My overarching goal is to be a good leader. I believe a good leader is also an effective mentor and coach. So, I am actively going to make myself a good mentor.

The following are the next steps for me:

Stop judging by the yardstick of “is this how I would have done it?”
It is okay if things are not how I thought they would be. If it is 80% there, it is good enough.
If there are areas of refinement, then definitely point them out.
Stop thinking that I could have done it quickly. Get comfortable with others being slow/fast.

Let’s see how it goes.

[Summary] Deep Recurrent Neural Networks for OYO Hotels Recommendation

2022-10-09T00:00:00+00:00

Paper link: Deep Recurrent Neural Networks for OYO Hotels Recommendation

Abstract

A hybrid model with two parts:
1. Embedding generation: generate implicit embeddings of properties.
2. Deep prediction and ranking model.
The model performed well over the existing collab-filtering model.

Situation/Context

OYO’s current recommendation system
- Graph-based Collaborative filtering model
- Optimised on browsing data as user feedback
- Objective: CTR
DL provides an opportunity to improve the system.

Lit Review

Conventional RecSys algos:
1. Collab-filtering,
2. Content-based, and
3. Hybrid
YouTube’s 2016 paper has demonstrated that DL-based RecSys can give SOTA results on high-volume data.
MF only considers the linear combination of user and item latent vectors. Whereas, DL can capture non-linear user-item relationships.
DL reduces the feature engineering efforts.
RNN facilitate temporal behaviour of user-item interactions: useful for session-based sequential recommendations. Conventional algos don’t capture this.
Research on modelling user behaviour sequences using LSTM or GRUs
This Airbnb paper (summary) takes the sequence of listing ids clicked by the users and trains a skip-gram word2vec model on it. And then rank using these embeddings.
The authors of this paper mention that they improve it by adding entity features along with click data.

Methodology

Embedding generation: generates embeddings of the hotels (intermediate output of the next step).
Prediction and ranking model: gets top-n recommendations-based on the following inputs:
- The sequence of browsed hotels
- Embeddings of the browsed hotels
- Rating tokens of the browsed hotels
- Realisation tokens of the browsed hotels

What was the candidate list of hotels? High-rated hotels?

Embedding Gen

Explicit feedback requires effort from the customers; hence, ratings are sparse.
Browsing data as user’s implicit feedback; thus, no sparsity.
In this work, implicit features were derived using an RNN.
- Embeddings were the intermediate output of the model training process.

Prediction and Ranking Model

Objective: realised bookings (conversion along with the realization of bookings)
Implemented the following four methods: RNN, GRU, LSTM, and BiLSTM.
Training data:
- 1 million users
- Sequences of their clicked hotels within a session
Pre-processing: padded and limited to 15 hotels.
Model objective: the probability of the user for realised booking at high-rated hotels.

Proposed architecture (disclaimer: I couldn’t grok it from the paper)

Embedding layer: 100 dim

 torch.nn.Embedding(num_embeddings=all_hotels, embedding_dim=100)

Embedding concat layer (torch.cat())
- Not sure why they concatenated the embeddings.. The embedding tensor should have been input to the RNN layer. Otherwise, no recurrence will happen.

2 BiLSTM layers

 torch.nn.LSTM(
     input_size=1530,
     hidden_size=512,
     num_layers=2,
     bidirectional=True)
 )

Flatten layer (torch.nn.Flatten())

4 ReLU dense layers

 l1 = torch.nn.ReLU(torch.nn.Linear(512*2, 512))
 l2 = torch.nn.ReLU(torch.nn.Linear(512, 256))
 l3 = torch.nn.ReLU(torch.nn.Linear(256, 128))
 l4 = torch.nn.ReLU(torch.nn.Linear(128, 1))

Softmax layer
Output layer

Embedding Evaluation

Embedding dimension: 100
Get the top 10 similar hotels for all the hotels in the training dataset using cosine similarity.
Four accuracy metrics:
1. Location
2. Distance
3. Price
4. Ratings
Metric formulation
\[\text{Sim Index @ x} = \frac{\sum_{i=1}^{H} \text{sim@x}(\text{top-10 hotels}, i)}{H} \\\]
- \(x\) can be any of the following: Location, Distance, Price, Ratings
- \(H\) is a set of all query hotels;
- \(\text{sim@x}(\text{top-10 hotels}, i)\) is the similarity score for metric \(x\).
- Ranges between 0 and 10.
\(\text{sim@Location}\): fraction of top-10 hotels lying in the same city as the query hotel \(i\).
\(\text{sim@Distance}\): fraction of top-10 hotels that are within a 20km radius of the query hotel \(i\).
\(\text{sim@Price}\): fraction of top-10 hotels that are within +/-15% of the price of the query hotel \(i\).
\(\text{sim@Ratings}\): fraction of top-10 hotels that are +/-1 rating from the query hotel \(i\).
Following are the evaluation results with the winner highlighted:
Qualitative eval also yielded positive results.

Ranking Model Evaluation

Offline evaluation metric: Hit Ratio, MRR
\(\text{Precision@k}\) or Hit Ratio: fraction of users for which the booked hotel was among the top-k recommendations.
\[\text{Precision@k} = \frac{U_{hit}^k}{U_{all}}\]
15 total model variants:
- 3 variants with basic RNN
- 4 variants each with LSTM, GRU, and BiLSTM
Selected one variant from each model type-based on validation results.
Created a dataset aligned with the real-time environment. (Session logs?)
Out-of-time validation on this dataset.
The BiLSTM variant was the best-performing model.
Online evaluation metrics:
- Realized bookings at high-rated hotels.
- C*R (multiplication of booking conversion and realization of bookings) at high-rated hotels.
Observed lifts of 3% to 6% in realized hotel bookings across different geographies.

Review Conclusion

The paper proposed building a DL model with two parts: embedding gen and ranking model.
The embeddings are the intermediate output of the ranking model. Not sure why it is called a separate model in the paper.
The model is an important part of this paper, yet
- It does not discuss the training data construction in detail.
- Few left-out details about the architecture made it difficult to comprehend it.
- There was no discussion about inferencing and the candidate set of restaurants to rank.
The embedding evaluation framework was comprehensive and quantified the effectiveness of the embeddings.
Model evaluation methodology followed the standard process of train-time validation and out-of-time validation steps.
One thing lacking was comparison with tree-based models like gradient boosted trees which have shown good performance in recommendation tasks in both industry and research.