<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://trigonaminima.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://trigonaminima.github.io/" rel="alternate" type="text/html" /><updated>2026-04-23T15:25:21+00:00</updated><id>https://trigonaminima.github.io/feed.xml</id><title type="html">Playground</title><subtitle>Notes on engineering, personal finance, and everyday curiosity — by Shivam Rana</subtitle><author><name>Shivam Rana</name></author><entry><title type="html">Move US Stocks from INDMoney to Interactive Brokers</title><link href="https://trigonaminima.github.io/2026/03/indmoney-to-ibkr/" rel="alternate" type="text/html" title="Move US Stocks from INDMoney to Interactive Brokers" /><published>2026-03-08T00:00:00+00:00</published><updated>2026-03-08T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2026/03/indmoney-to-ibkr</id><content type="html" xml:base="https://trigonaminima.github.io/2026/03/indmoney-to-ibkr/"><![CDATA[<p>I wanted to move my US stocks (securities) from INDMoney to Interactive Brokers. I didn’t find any clear steps documented anywhere. It took me multiple attempts to finally be able to move over my portfolio.</p>

<p>This post is divided into two parts: steps followed and FAQs.</p>

<h2 id="steps">Steps</h2>

<p>Absolute Pre-requisites</p>

<ol>
  <li>Have an active account on IBKR. That means account verified and ready to buy stocks.</li>
  <li>Make your fractional shares whole on INDMoney. That means either going from 1.6 qty to 1 qty or 2 qtys.</li>
  <li>Keep at least <em>USD 70</em> in the INDMoney wallet.</li>
</ol>

<p>Now, here are the steps to transfer:</p>

<ol>
  <li>Log into IBKR.</li>
  <li>In the menu, go to <code class="language-plaintext highlighter-rouge">Tansfers (Deposit, Withdraw and Transfer History)</code> (or whatever they are calling it now).</li>
  <li>Go to <code class="language-plaintext highlighter-rouge">Transfer Positions</code>.</li>
  <li>Select <code class="language-plaintext highlighter-rouge">Incoming</code></li>
  <li>Select region as <code class="language-plaintext highlighter-rouge">United States</code> as DriveWealth, the broker used by INDMoney, is in US.</li>
  <li>Select <code class="language-plaintext highlighter-rouge">ACATS</code> as the transfer method. (usually the very first option and the most convenient)</li>
  <li>Choose <code class="language-plaintext highlighter-rouge">DriveWealth</code> in the broker dropdown.</li>
  <li>My account number looked like: <code class="language-plaintext highlighter-rouge">IFSC-XXX-&lt;Account number on the INDMoney app&gt;</code> (17 characters). Account number is the crucial part. One of my requests failed becuase of this. INDMoney doesn’t make it easy to find your account number. DON’T use the one shown on the app. Follow: <code class="language-plaintext highlighter-rouge">US Stocks Tab</code> → <code class="language-plaintext highlighter-rouge">Manage</code> → <code class="language-plaintext highlighter-rouge">US Stocks Reports</code> → <code class="language-plaintext highlighter-rouge">US Trades Report</code> to find your account numner.</li>
  <li>Account Title and Tax Identification Number are pre-selected. Choose your Account Type. Mine was <code class="language-plaintext highlighter-rouge">Individual</code>.</li>
  <li>Decide <code class="language-plaintext highlighter-rouge">FULL ACATS</code> or <code class="language-plaintext highlighter-rouge">PARTIAL ACATS</code> transfer. I did FULL ACATS. You can’t have any fractional shares for full transfer. Refer FAQs for more details.</li>
  <li>Give authorisation to IBKR to take appropriate actions when the positions of you are transferring are not the products IBKR supports. IBKR has given details on what you are giving them the authorisation for. I selected <code class="language-plaintext highlighter-rouge">Yes</code> for everything.</li>
  <li>Verify your details. Sign and submit.</li>
</ol>

<p>IBKR will validate and confirm the request, submit it to the DriveWealth, receive the stocks. You will get an email from IBKR about the completion or failure.</p>

<h2 id="faqs">FAQs</h2>

<p><strong>Q:</strong> What are the tax implications of the transfer process?<br />
<strong>A:</strong> My research told me that transferring stocks incurs <em>no capital gains tax</em>, as it’s a dealer transfer, not a sale. Verify with your CA. Notwithstanding, maintain your transaction records with the earlier platform for tax filing purposes.</p>

<p><strong>Q:</strong> Who is INDMoney’s broker?<br />
<strong>A:</strong> DriveWealth</p>

<p><strong>Q:</strong> Region to select for the transfer.<br />
<strong>A:</strong> DriveWealth is a US broker. So region is United States of America.</p>

<p><strong>Q:</strong> What is incoming ACATS and outgoing ACATS?<br />
<strong>A:</strong> Since I wanted to get the securities out of DriveWealth, it is going to be an <em>outgoing ACATS</em> for DriveWealth and <em>incoming ACATS</em> for Interactive Brokers.</p>

<p><strong>Q:</strong> Who initiates the ACATS (Automated Customer Account Transfer Service) transfer request?<br />
<strong>A:</strong> Always the receiving broker. So, there is no need to interact with INDMoney unless there is some snag in the process and you need help.</p>

<p><strong>Q:</strong> Is DriveWealth (INDMoney’s underlying broker) ACATS transfer enabled?<br />
<strong>A:</strong> Yes, I confirmed this with INDMoney.</p>

<p><strong>Q:</strong> Does DriveWealth charge a fee for the (outbound) transfer?<br />
<strong>A:</strong> Yes, I was charged <em>USD 65</em>. I confirmed this with an INDMoney customer support request. Their explanation:</p>

<blockquote>
  <p>For outgoing ACATS transfers, a fee is typically applied by your U.S. broker. This charge is levied by the broker currently holding your assets to facilitate the transfer out of their system.</p>
</blockquote>

<p><strong>Q:</strong> How do I pay for the DriveWealth fee?<br />
<strong>A:</strong> Keep at least USD 70 in the INDMoney wallet.</p>

<p><strong>Q:</strong> Does IBKR charge a fee for the (inbound) transfer?<br />
<strong>A:</strong> I was <em>not</em> charged anything.</p>

<p><strong>Q:</strong> Is there a penalty for transfer failures due to any errors?<br />
<strong>A:</strong> I was <em>not</em> charged anything. I successfully transfer in my third attempt.</p>

<p><strong>Q:</strong> How long does it take to complete the transfer after submission.<br />
<strong>A:</strong> It took 4 working days from submission to completion for me. Under standard conditions, ACATS transfers to IBKR complete in about 4–8 business days after submission.</p>

<blockquote>
  <p>Gemini says that some industry sources note 3–5 business days in smooth cases. </p>
</blockquote>

<p><strong>Q:</strong> Can I trade on INDMoney during my transfer process?<br />
<strong>A:</strong> I avoided it after I raised my transfer request. INDMoney’s response:</p>

<blockquote>
  <p>During the outgoing ACATS transfer, trading activity (including deposits, buys, and sells) will be temporarily restricted to ensure the process completes without errors.</p>
</blockquote>

<p><strong>Q:</strong> Can I trade on IBKR during my transfer process?<br />
<strong>A:</strong> Yes.</p>

<p><strong>Q:</strong> How does it work if the residency status has changed - INDMoney app having residency A and IBKR having residency B?<br />
<strong>A:</strong> Many people on this <a href="https://www.reddit.com/r/INDmoneyApp/comments/1o1wxdj/how_to_handle_us_stock_investments_after_becoming/">reddit thread</a> seemed to think so. I tried, and it worked for me.</p>

<p><strong>Q:</strong> My address is different on INDMoney and IBKR. Will it work?<br />
<strong>A:</strong> It worked for me. According to Gemini, the ACATS system relies on an <strong>exact match</strong> of key identifying information between the delivering account (INDmoney’s partner broker) and the receiving account (IBKR) to prevent unauthorized transfers. The key matching criteria is:</p>

<ul>
  <li>Account Title/Registration: Your name must be identical on both accounts.</li>
  <li>Account Type: (e.g., Individual, Joint, etc.) must match.</li>
  <li>Tax ID: (e.g., Social Security Number, or other tax identification used) must match.</li>
</ul>

<p>All three were supplied by me. So, I guess that’s enough. I think it would have mattered if my country of residence had been US, because then my Tax ID would have changed.</p>

<p><strong>Q:</strong> How do I track my transfer?<br />
<strong>A:</strong> INDMoney doesn’t tell you anything about the transfer or the status. You will not even get a notification/email after the transfer is complete. Only place to track the transfer is on IBKR. Follow the same steps that you followed to initiate the transfer. IBKR will guide to the the status window.</p>

<p><strong>Q:</strong> How are fractional shares handled?<br />
<strong>A:</strong> This caused a lot of research for me. Everyone online mentioned that you can’t transfer fractional shares. INDMoney also confirmed this:</p>

<blockquote>
  <p>Please note that fractional shares cannot be transferred. These may be liquidated as part of the transfer process.</p>
</blockquote>

<p>So this was clear. What was not clear was: do I need to make them whole or will it only transfer the complete units and leave the fractional units in INDMoney? The latter part of INDMoney’s response gave the impression that it will handle the liquidation on it’s own. This assumption was incorrect. My request was rejected because of fractional shares.</p>

<p>So, as I mentioned in the pre-requisites, make your fractional shares whole or exit entirely. That is, turn 3.55 units of NVDA into 3 or 4 or 0 units or NVDA.</p>

<p><strong>Q:</strong> Can I use partial transfer to avoid selling my fractional shares?<br />
<strong>A:</strong> I was always trying for full transfer. When the request was rejected due to fractional shares, I contemplated using partial transfer to avoid exiting. I was avoiding selling for two reasons:</p>

<ul>
  <li>I wasn’t sure if selling would lead to any tax implications;</li>
  <li>Some of my stocks were bought at very attractive prices. I want to protect my avg price.</li>
</ul>

<p>After some thought, I decided to make them whole. Exited some of the fractional shares and bought some of the others (when there was a dip).</p>

<p><strong>Q:</strong> Does the ACATS transfer also take care of transfering my buy/sell history, avg price and other associated history?<br />
<strong>A:</strong> Not sure what is supposed to be transfered. What I got was only my avg price. All the past buy-sell history was gone. Since it was gone, my CAGR type metrics were also gone.</p>

<p><strong>Q:</strong> What resources did you refer to understand the process?<br />
<strong>A:</strong> Here are some links</p>

<ul>
  <li><a href="https://paasa.com/blog/transfer-indmoney-to-paasa">Paasa Blog</a></li>
  <li><a href="https://support.vestedfinance.com/portal/en/kb/articles/how-do-i-migrate-my-us-brokerage-account-from-indmoney-stockal-to-vested-31-1-2024">Vested support</a></li>
  <li><a href="https://groww.in/blog/how-to-transfer-us-stocks-from-groww-to-external-platform">Groww blog</a></li>
  <li><a href="https://x.com/thetrickytrade/status/1749719803777147123">X/Twitter: 🚛 How to transfer your US Holdings out of Groww?</a></li>
  <li><a href="https://www.reddit.com/r/INDmoneyApp/comments/1meyz4u/has_anyone_successfully_transferred_their/">Reddit Question 1</a></li>
  <li><a href="https://www.reddit.com/r/INDmoneyApp/comments/1o1wxdj/how_to_handle_us_stock_investments_after_becoming/">Reddit discussion: how to handle INDMoney US investments after becoming NRI</a></li>
</ul>

<p><br /></p>]]></content><author><name>Shivam Rana</name></author><category term="Investing" /><summary type="html"><![CDATA[I wanted to move my US stocks (securities) from INDMoney to Interactive Brokers. I didn’t find any clear steps documented anywhere. It took me multiple attempts to finally be able to move over my portfolio.]]></summary></entry><entry><title type="html">PyTorch Fundamentals - Week 5</title><link href="https://trigonaminima.github.io/2025/11/pytorch-fundamentals3/" rel="alternate" type="text/html" title="PyTorch Fundamentals - Week 5" /><published>2025-11-29T00:00:00+00:00</published><updated>2025-11-29T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2025/11/pytorch-fundamentals3</id><content type="html" xml:base="https://trigonaminima.github.io/2025/11/pytorch-fundamentals3/"><![CDATA[<p>Brushing up on my PyTorch skills every week. Starting from scratch. Not in a hurry. The goal is to follow along <a href="https://github.com/Exorust/TorchLeet">TorchLeet</a> and go up to <a href="https://github.com/karpathy/nanoGPT">karpathy/nanoGPT</a> or <a href="https://github.com/karpathy/nanochat">karpathy/nanochat</a>. Previously,</p>

<ol>
  <li><a href="/2025/11/pytorch-fundamentals2/">PyTorch Fundamentals - Week 4</a></li>
  <li><a href="/2025/11/pytorch-fundamentals/">PyTorch Fundamentals - Week 1, 2, &amp; 3</a></li>
</ol>

<p>Now, summary of the week 4.</p>

<ul>
  <li>Linear Regression through a custom DNN with a non-linearity
    <ul>
      <li>A <code class="language-plaintext highlighter-rouge">nn.Module</code> subclass that used a nn.Sequential to have the dense linear layer.
        <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="n">dense</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
      <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">dense_dims</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
      <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
      <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="o">*</span><span class="n">dense_dims</span><span class="p">),</span>
      <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
      <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dense_dims</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">output_dim</span><span class="p">),</span>
  <span class="p">)</span>
</code></pre></div>        </div>
      </li>
    </ul>
  </li>
  <li>Learnt about <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.Sequential.html"><code class="language-plaintext highlighter-rouge">nn.Sequential</code></a> vs <code class="language-plaintext highlighter-rouge">nn.Module</code>.
    <ul>
      <li><code class="language-plaintext highlighter-rouge">nn.Sequential</code> is a subclass of <code class="language-plaintext highlighter-rouge">nn.Module</code>.</li>
      <li>It’s a convinient way to chain functions.</li>
      <li>Think: replacing <code class="language-plaintext highlighter-rouge">self.l1(self.l2(self.l3(x)))</code> with <code class="language-plaintext highlighter-rouge">self.layer123(x)</code> where <code class="language-plaintext highlighter-rouge">layer123</code> is a <code class="language-plaintext highlighter-rouge">nn.Sequential</code> of <code class="language-plaintext highlighter-rouge">l1</code>, <code class="language-plaintext highlighter-rouge">l2</code>, and <code class="language-plaintext highlighter-rouge">l1</code>.</li>
      <li>Also read: <a href="https://stackoverflow.com/q/68606661/2650427">What is difference between nn.Module and nn.Sequential</a></li>
    </ul>
  </li>
  <li>Tried three different dense layer hidden unit config with each of the below loss functions:
    <ol>
      <li><code class="language-plaintext highlighter-rouge">nn.MSELoss()</code></li>
      <li><code class="language-plaintext highlighter-rouge">HuberLoss()</code> - created <a href="/2025/11/pytorch-fundamentals2/">last week</a></li>
      <li><code class="language-plaintext highlighter-rouge">nn.L1Loss()</code></li>
    </ol>

    <p>Dense layer hidden unit variants: <code class="language-plaintext highlighter-rouge">(3, 7)</code>, <code class="language-plaintext highlighter-rouge">(4, 8)</code>, <code class="language-plaintext highlighter-rouge">(5, 9)</code>.</p>
  </li>
  <li>Huber Loss threw MSE Loss and L1 Loss out of the water. (Figure from TensorBoard.)
    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2025-11/dnn_loss.png" alt="" style="text-align: center; margin: auto" width="300" />
  <!-- <figcaption style="text-align: center">Figure 1:</figcaption> -->
  </figure>
  </li>
  <li>TensorBoard at work
    <ul>
      <li>I was running a remote training pipeline with tensorboard logging. This pipline dumps the logs on S3.</li>
      <li>The pipeline was failing with the following error: <code class="language-plaintext highlighter-rouge">File system scheme 's3' not implemented</code>.</li>
      <li>Solution:
        <ul>
          <li>After multiple loops of [add + commit + push + job run] found the <a href="https://stackoverflow.com/a/71628326/2650427">solution</a>: <code class="language-plaintext highlighter-rouge">pip install tensorflow-io</code>. This is not the end of the solution.</li>
          <li>The <a href="https://github.com/tensorflow/tensorboard/issues/5480#issuecomment-2251363802">very last comment</a> on the <a href="https://github.com/tensorflow/tensorboard/issues/5480">tensorboard github issue</a> gives the final trick: Add <code class="language-plaintext highlighter-rouge">import tensorflow_io</code> to the pipeline even if not using it anywhere.</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<p><br /></p>]]></content><author><name>Shivam Rana</name></author><category term="DL" /><summary type="html"><![CDATA[Brushing up on my PyTorch skills every week. Starting from scratch. Not in a hurry. The goal is to follow along TorchLeet and go up to karpathy/nanoGPT or karpathy/nanochat. Previously,]]></summary></entry><entry><title type="html">PyTorch Fundamentals - Week 4</title><link href="https://trigonaminima.github.io/2025/11/pytorch-fundamentals2/" rel="alternate" type="text/html" title="PyTorch Fundamentals - Week 4" /><published>2025-11-22T00:00:00+00:00</published><updated>2025-11-22T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2025/11/pytorch-fundamentals2</id><content type="html" xml:base="https://trigonaminima.github.io/2025/11/pytorch-fundamentals2/"><![CDATA[<p>Brushing up on my PyTorch skills every week. Starting from scratch. Not in a hurry. The goal is to follow along <a href="https://github.com/Exorust/TorchLeet">TorchLeet</a> and go up to <a href="https://github.com/karpathy/nanoGPT">karpathy/nanoGPT</a> or <a href="https://github.com/karpathy/nanochat">karpathy/nanochat</a>. Previously,</p>

<ol>
  <li><a href="/2025/11/pytorch-fundamentals/">PyTorch Fundamentals - Week 1, 2, &amp; 3</a></li>
</ol>

<p>Now, summary of the week 4.</p>

<ul>
  <li>Custom Loss function: Huber Loss
    <ul>
      <li>A <code class="language-plaintext highlighter-rouge">torch.nn.Module</code> class having with <code class="language-plaintext highlighter-rouge">forward()</code> method to compute the loss.</li>
      <li>
        <p>Huber Loss is defined as:</p>

\[L_{\delta}(y, \hat{y}) =
      \begin{cases}
      \frac{1}{2}(y - \hat{y})^2 &amp; \text{for } |y - \hat{y}| \leq \delta, \\
      \delta \cdot (|y - \hat{y}| - \frac{1}{2} \delta) &amp; \text{for } |y - \hat{y}| &gt; \delta,
      \end{cases}\]

        <p>where:</p>
        <ul>
          <li>\(y\) is the true value,</li>
          <li>\(\hat{y}\) is the predicted value,</li>
          <li>\(\delta\) is a threshold parameter that controls the transition between L1 and L2 loss.</li>
        </ul>
      </li>
      <li>More details about the Huber Loss</li>
    </ul>
  </li>
  <li>Some custom losses in Keras and PyTorch: <a href="https://www.kaggle.com/code/bigironsphere/loss-function-library-keras-pytorch/notebook">Loss Function Library - Keras &amp; PyTorch</a></li>
  <li>Used the Linear Regression model to test the custom loss.</li>
  <li>Error 1: <code class="language-plaintext highlighter-rouge">RuntimeError: grad can be implicitly created only for scalar outputs</code>
    <ul>
      <li>Reason: the <code class="language-plaintext highlighter-rouge">forward()</code> function was returning a tensor with lenght &gt; 1. Got the hint from <a href="https://discuss.pytorch.org/t/loss-backward-raises-error-grad-can-be-implicitly-created-only-for-scalar-outputs/12152">PyTorch forums</a>.</li>
      <li>Fix: returned <code class="language-plaintext highlighter-rouge">loss.mean()</code> instead of <code class="language-plaintext highlighter-rouge">loss</code>.</li>
    </ul>
  </li>
  <li>Error 2: All the losses were <code class="language-plaintext highlighter-rouge">nan</code>: this was a genuine bug in my code.</li>
  <li>Implementation approach 1: Use masks (my approach)
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">y_true</span><span class="p">):</span>
      <span class="n">error</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">y_true</span> <span class="o">-</span> <span class="n">y_pred</span><span class="p">)</span>

      <span class="n">flag1</span> <span class="o">=</span> <span class="n">error</span> <span class="o">&lt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">d</span>
      <span class="n">flag2</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">error</span>

      <span class="n">l2_loss</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">error</span><span class="o">**</span><span class="mi">2</span> <span class="o">*</span> <span class="n">flag1</span>
      <span class="n">l1_loss</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">d</span> <span class="o">*</span> <span class="p">(</span><span class="n">error</span> <span class="o">-</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">d</span><span class="p">)</span> <span class="o">*</span> <span class="n">flag2</span>
      <span class="n">loss</span> <span class="o">=</span> <span class="n">l2_loss</span> <span class="o">+</span> <span class="n">l1_loss</span>
      <span class="k">return</span> <span class="n">loss</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li>Implementation approach 2: Use <a href="https://docs.pytorch.org/docs/stable/generated/torch.where.html"><code class="language-plaintext highlighter-rouge">torch.where()</code></a> (solution provided in <a href="https://github.com/Exorust/TorchLeet/blob/main/torch/basic/custom-loss/custom-loss_SOLN.ipynb">TorchLeet</a>)
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">y_true</span><span class="p">):</span>
      <span class="n">error</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">y_true</span> <span class="o">-</span> <span class="n">y_pred</span><span class="p">)</span>

      <span class="n">condition</span> <span class="o">=</span> <span class="n">error</span> <span class="o">&lt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">d</span>
      <span class="n">loss</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">condition</span><span class="p">,</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">error</span><span class="o">**</span><span class="mi">2</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">d</span> <span class="o">*</span> <span class="p">(</span><span class="n">error</span> <span class="o">-</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">d</span><span class="p">))</span>
      <span class="k">return</span> <span class="n">loss</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li>Turns out, <code class="language-plaintext highlighter-rouge">torch.where()</code> is the most optimised way of doing this. It is vectorised and GPU-friendly. It is also a cleaner implementation of the same logic. Masking will require extra memory and extra operations (two multiplications, and one addition).</li>
  <li>Used tensorboard to visualise the training results.</li>
  <li>Read up more on <a href="https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html#torch.optim.Optimizer.zero_grad"><code class="language-plaintext highlighter-rouge">optimizer.zero_grad()</code></a>.
    <ul>
      <li>PyTorch accumulates gradients by default. The <code class="language-plaintext highlighter-rouge">loss.backward()</code> will add to the previous gradients (can be accessed by <code class="language-plaintext highlighter-rouge">weight.grad</code>).</li>
      <li>If we don’t reset the gradients using <code class="language-plaintext highlighter-rouge">zero_grad()</code>, the new gradient will be a combination of the old and the newly-computed gradient. Since the old gradient was already used to update the model in the last iteration, the combined gradient will point in a different direction than the minimum (or maximum.) [<a href="https://stackoverflow.com/q/48001598/2650427">ref</a>]</li>
    </ul>
  </li>
  <li><strong>Q:</strong> When should we use <code class="language-plaintext highlighter-rouge">zero_grad()</code>? <strong>A:</strong> When we want gradient accumulation on purponse.</li>
  <li>
    <p><strong>Q:</strong> When do we want gradient accumulation on purpose? <strong>A:</strong> In the following scenarios:</p>

    <ol>
      <li>Large batch size with limited gpu memory. Split the batch into mini-batches. Accumulate gradients for all the mini-batches and then run <code class="language-plaintext highlighter-rouge">optimizer.step()</code>. Used during training on smaller GPUs.</li>
      <li>Multiple loss components before a single update. Useful for multi-task learning. Losses that require multiple passes.</li>
      <li>Parallel training. When model is split across devices -&gt; accumulate the gradients across the micro-batches and then update parameters once.</li>
      <li>Training with noisy gradients. Accumulate over multiple steps with noisy gradients to smooth the gradients before updating.</li>
    </ol>
  </li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="DL" /><summary type="html"><![CDATA[Brushing up on my PyTorch skills every week. Starting from scratch. Not in a hurry. The goal is to follow along TorchLeet and go up to karpathy/nanoGPT or karpathy/nanochat. Previously,]]></summary></entry><entry><title type="html">PyTorch Fundamentals - Week 1, 2, &amp;amp; 3</title><link href="https://trigonaminima.github.io/2025/11/pytorch-fundamentals/" rel="alternate" type="text/html" title="PyTorch Fundamentals - Week 1, 2, &amp;amp; 3" /><published>2025-11-10T00:00:00+00:00</published><updated>2025-11-10T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2025/11/pytorch-fundamentals</id><content type="html" xml:base="https://trigonaminima.github.io/2025/11/pytorch-fundamentals/"><![CDATA[<p>Brushing up on my PyTorch skills every week. Starting from scratch. Not in a hurry. The goal is to follow along <a href="https://github.com/Exorust/TorchLeet">TorchLeet</a> and go up to <a href="https://github.com/karpathy/nanoGPT">karpathy/nanoGPT</a> or <a href="https://github.com/karpathy/nanochat">karpathy/nanochat</a>. Summary of the 1st three weeks.</p>

<h3 id="week-1">Week 1</h3>

<ul>
  <li>Create a Linear Regression model.
    <ul>
      <li><code class="language-plaintext highlighter-rouge">torch.nn.Linear</code> to define a learnable model.</li>
      <li><code class="language-plaintext highlighter-rouge">forward()</code> for forward pass.</li>
      <li><code class="language-plaintext highlighter-rouge">model.parameters</code> containing all the learned weights and also passed to the optimizer (<code class="language-plaintext highlighter-rouge">SGD</code>, <code class="language-plaintext highlighter-rouge">Adam</code>, etc.)</li>
      <li>Use <code class="language-plaintext highlighter-rouge">torch.no_grad()</code> during inferencing.</li>
    </ul>
  </li>
  <li>Log the training logs to TensorBoard. (not a part of the TorchLeet repo)
    <ul>
      <li><code class="language-plaintext highlighter-rouge">SummaryWriter</code> from <code class="language-plaintext highlighter-rouge">torch.utils.tensorboard</code>. Tensorflow can directly use a callback inside the fit function to push all the relevant logs. The <code class="language-plaintext highlighter-rouge">SummaryWriter</code> gives a fine-grained control to log anything.</li>
      <li><code class="language-plaintext highlighter-rouge">add_scalar()</code> to log the training loss.</li>
      <li><code class="language-plaintext highlighter-rouge">add_graph()</code> to log the graph itself.</li>
      <li>
        <p>Load the Jupyter tensorboard extension so that we don’t have to leave the notebook to look at the logs and and pretty plots.</p>

        <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  %load_ext tensorboard
</code></pre></div>        </div>
      </li>
      <li>
        <p>Load the tensorboard UI inside the notebook.</p>

        <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  %tensorboard <span class="nt">--logdir</span> PATH_TO_LOG_DIR
</code></pre></div>        </div>
      </li>
    </ul>
  </li>
</ul>

<h3 id="week-2">Week 2</h3>

<ul>
  <li>Create a Dataset
    <ul>
      <li><code class="language-plaintext highlighter-rouge">Dataset</code> class from <code class="language-plaintext highlighter-rouge">torch.utils.data</code>.</li>
      <li>Create a subclass of <code class="language-plaintext highlighter-rouge">Dataset</code> for my specific dataset. Added <code class="language-plaintext highlighter-rouge">data</code>, <code class="language-plaintext highlighter-rouge">X</code> and <code class="language-plaintext highlighter-rouge">y</code> attributes to the class.</li>
      <li>Since we will iterate through the rows of this dataset, defined <code class="language-plaintext highlighter-rouge">__len__</code> and <code class="language-plaintext highlighter-rouge">__getitem__</code> functions. These overloaded functions enable code like <code class="language-plaintext highlighter-rouge">len(dataset)</code> and <code class="language-plaintext highlighter-rouge">dataset[i]</code>, respectively.</li>
    </ul>
  </li>
  <li>Dataloader
    <ul>
      <li>
        <p><code class="language-plaintext highlighter-rouge">Dataset</code> only defines the dataset. <code class="language-plaintext highlighter-rouge">Dataloader</code> from <code class="language-plaintext highlighter-rouge">torch.utils.data</code> creates an iterator. It also brings other capabilities like batching and shuffling. Eg:</p>

        <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="n">dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div>        </div>
      </li>
      <li>
        <p>We can run a <code class="language-plaintext highlighter-rouge">for</code> loop on this <code class="language-plaintext highlighter-rouge">dataloader</code> now.</p>
      </li>
    </ul>
  </li>
  <li>Good intro to the topic from PyTorch - <a href="https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html">Datasets and DataLoaders</a>.</li>
  <li>Trained the Linear Regression model using the dataloader. Faced some issues due to <code class="language-plaintext highlighter-rouge">dtype</code> mismatches - used <code class="language-plaintext highlighter-rouge">torch.float32</code> everywhere to fix it.</li>
  <li>The exercise only asked for a single column dataset. Played around with a dataset with multiple columns.</li>
  <li>Use tensorboard for all the logging.</li>
</ul>

<h3 id="week-3">Week 3</h3>

<ul>
  <li>Two types of activation functions – with learnable parameters and without.</li>
  <li>Activation function with learnable parameters will require a <code class="language-plaintext highlighter-rouge">nn.Module</code> subclass. It is required to do the gradient calculations using <code class="language-plaintext highlighter-rouge">forward</code> and <code class="language-plaintext highlighter-rouge">backward</code> functions and get the final trained weights.</li>
  <li>Created a custom activation <em>without</em> learnable parameters: \(\text{tanh}(x) + x\).</li>
  <li>
    <p>Updated the Linear Regression class to have the final output go through \(\text{tanh}(x) + x\) using <code class="language-plaintext highlighter-rouge">torch.tanh()</code>.</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">custom_activation</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">linear</span><span class="o">*</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</code></pre></div>    </div>
  </li>
  <li>This <a href="https://stackoverflow.com/a/57013056/2650427">SO answer</a> talks about how to write a custom activation function in different scenarios: non-learnable, learnable, learnable with PyTorch functions, and  learnable without PyTorch functions.</li>
  <li>Also learned about <code class="language-plaintext highlighter-rouge">torch.nn.Parameter</code> and <code class="language-plaintext highlighter-rouge">torch.nn.Variable</code>.</li>
  <li>Kept using Tensorboard for all the logging.</li>
</ul>

<p><br /></p>

<p>Next 2 weeks: Custom Loss Function (Huber Loss) and Deep Neural Network</p>

<hr />]]></content><author><name>Shivam Rana</name></author><category term="DL" /><summary type="html"><![CDATA[Brushing up on my PyTorch skills every week. Starting from scratch. Not in a hurry. The goal is to follow along TorchLeet and go up to karpathy/nanoGPT or karpathy/nanochat. Summary of the 1st three weeks.]]></summary></entry><entry><title type="html">Life Logging: Calls</title><link href="https://trigonaminima.github.io/2025/09/life-logging-calls-tasker/" rel="alternate" type="text/html" title="Life Logging: Calls" /><published>2025-09-28T00:00:00+00:00</published><updated>2025-09-28T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2025/09/life-logging-calls-tasker</id><content type="html" xml:base="https://trigonaminima.github.io/2025/09/life-logging-calls-tasker/"><![CDATA[<p>I am building a comprehensive set of tools to do life logging. General idea is:</p>

<ul>
  <li>Push everything to a sink; and</li>
  <li>Visualise the data in this sink.</li>
</ul>

<p>Objective is to do weekly reviews and take interventions if things are not BAU. Long term vision is to eventually have enough signals to give me a comprehensive understanding of myself (physical, mental, social, financial, etc).</p>

<p>This post is about logging calls using <a href="https://tasker.joaoapps.com/">Tasker for Android</a>.</p>

<h2 id="logging-phone-calls">Logging Phone Calls</h2>

<p>Steps:</p>

<ol>
  <li>Trigger on the event “Phone Idle” (whenever the phone goes to an idle state - incoming call, missed call, and outgoing call)</li>
  <li>Read the data provider <code class="language-plaintext highlighter-rouge">content://call_log/calls</code> to get the most recent call details</li>
  <li>Format the details for my use</li>
  <li>Push to an an endpoint that saves this data in a table.</li>
</ol>

<p>This is how the logs looks like:</p>

<blockquote>
  <p>#call(40) +type[miss] @num[Friend Number] @name[Friend nName] +mode[phone] +add[My Location]</p>
</blockquote>

<blockquote>
  <p>#call(40) +type[in] @num[Friend Number] @name[Friend nName] +mode[phone] +add[My Location]</p>
</blockquote>

<blockquote>
  <p>#call(84) +type[out] @num[Unsaved Caller’s Number] @name[<null>] +mode[phone] +add[My Location]</null></p>
</blockquote>

<details closed="">
<summary>Tasker profile to log phone calls</summary>

<figure class="highlight"><pre><code class="language-shell" data-lang="shell"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
</pre></td><td class="code"><pre>Profile: Log Phone Calls
        Event: Phone Idle

    Enter Task: Call Logs

    A1: Variable Set <span class="o">[</span>
        Name: %call_log_cols
        To: <span class="nb">date</span>, geocoded_location, countryiso, <span class="nb">type</span>, number, name, duration
        Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

    A2: SQL Query <span class="o">[</span>
        Mode: URI Formatted
        File: content://call_log/calls
        Columns: %call_log_cols
        Order By: <span class="nb">date </span>desc
        Output Column Divider: ,
        Variable Array: %call_logs

    A3: Multiple Variables Set <span class="o">[</span>
        Names: %qs_ts, %call_geocoded_location, %call_countryiso, %call_type, %call_number, %caller_name, %call_duration
        Variable Names Splitter: ,
        Values: %call_logs<span class="o">(</span>1<span class="o">)</span>
        Values Splitter: ,
        Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>
        Use Global Namespace: On <span class="o">]</span>

    A4: Variable Clear <span class="o">[</span>
        Name: %call_logs <span class="o">]</span>

    A5: If <span class="o">[</span> %qs_ts neq %LAST_CALLTS <span class="o">]</span>

        A15: If <span class="o">[</span> %call_type eq 1 <span class="o">]</span>

            A16: Variable Set <span class="o">[</span>
                Name: %call_type
                To: <span class="k">in
                </span>Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A17: Else
            If  <span class="o">[</span> %call_type eq 2 <span class="o">]</span>

            A18: Variable Set <span class="o">[</span>
                Name: %call_type
                To: out
                Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A19: Else
            If  <span class="o">[</span> %call_type eq 3 <span class="o">]</span>

            A20: Variable Set <span class="o">[</span>
                Name: %call_type
                To: miss
                Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A21: End If

        A22: Variable Set <span class="o">[</span>
            Name: %qs_note
            To: <span class="c">#call(%call_duration) +type[%call_type] @num[%call_number] @name[%caller_name] +mode[phone]</span>
            Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A23: Perform Task <span class="o">[</span>
            Name: Commons: POST Note &amp; Location
            Priority: %priority
            Local Variable Passthrough: On
            Limit Passthrough To: %qs_note, %qs_ts
            Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On
            Continue Task After Error:On <span class="o">]</span>

        A24: Variable Set <span class="o">[</span>
            Name: %LAST_CALLTS
            To: %qs_ts
            Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

    A25: End If
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<!-- <br> -->

<h2 id="logging-whatsapp-calls">Logging WhatsApp Calls</h2>

<p>You can’t read whatsapp calls from some data provider like normal phone calls. WhatsApp also doesn’t support exporting the call logs. These calls are also not available in normal phone call logs. The only method was to read WhatsApp notification logs to get the details. Here are the steps:</p>

<ol>
  <li>Every time WhatsApp gives a notification, run the next set of steps.</li>
  <li>If the notification was of a (audio/video) call, then get the data out in the relevant variables.</li>
  <li>Push to an an endpoint that saves this data in a table.</li>
</ol>

<p>Caveats:</p>

<ol>
  <li>WhatsApp generates separate calls related notifications:
    <ul>
      <li>incoming audio/video call</li>
      <li>missed audio/call call (after an incoming call is missed)</li>
      <li>If multiple calls have piled up then a separate notifation of (2+ missed calls from …)</li>
      <li>An outgoing calls just says: “calling…” –&gt; so, no audio/video label.</li>
    </ul>
  </li>
  <li>Since this is just a call notification (incoming, outgoing), there is no call duration available</li>
</ol>

<p>This is how the logs looks like:</p>

<blockquote>
  <p>#call(-1) +type[miss] @num[null] @name[Friend Name] +mode[whatsapp-video] +add[My Location]</p>
</blockquote>

<blockquote>
  <p>#call(-1) +type[in] @num[null] @name[Friend Name] +mode[whatsapp-video] +add[My Location]</p>
</blockquote>

<blockquote>
  <p>#call(-1) +type[out] @num[null] @name[Friend Name] +mode[whatsapp-any] +add[My Location]</p>
</blockquote>

<details closed="">
<summary>Tasker profile to log WA calls</summary>

<figure class="highlight"><pre><code class="language-shell" data-lang="shell"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
</pre></td><td class="code"><pre>Profile: Log WhatsApp Calls
    	Event: Notification <span class="o">[</span> Owner Application:WhatsApp Title:<span class="k">*</span> Text:<span class="k">*</span> Subtext:<span class="k">*</span> Messages:<span class="k">*</span> Other Text:<span class="k">*</span> Cat:<span class="k">*</span> New Only:On <span class="o">]</span>

    Enter Task: WhatsApp Call Logs

    A3: If <span class="o">[</span> %evtprm7 eq call <span class="o">]</span>

        A4: Multiple Variables Set <span class="o">[</span>
             Names: %qs_ts, %call_geocoded_location, %call_countryiso, %call_type_str, %call_number, %caller_name, %call_duration, %call_type,%call_mode
             Variable Names Splitter: ,
             Values: %TIMEMS,,,%evtprm3,null,%evtprm2,-1,null,any
             Values Splitter: ,
             Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A5: If <span class="o">[</span> %call_type_str ~R .<span class="k">*</span>Calling.<span class="k">*</span> <span class="o">]</span>

            A6: Variable Set <span class="o">[</span>
                 Name: %call_type
                 To: out
                 Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A7: Else
            If  <span class="o">[</span> %call_type_str ~R .<span class="k">*</span>Incoming.<span class="k">*</span> <span class="o">]</span>

            A8: Variable Set <span class="o">[</span>
                 Name: %call_type
                 To: <span class="k">in
                 </span>Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A9: Else
            If  <span class="o">[</span> %call_type_str ~R .<span class="k">*</span>Missed.<span class="k">*</span> <span class="o">]</span>

            A10: Variable Set <span class="o">[</span>
                  Name: %call_type
                  To: miss
                  Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A11: End If

        A12: If <span class="o">[</span> %call_type_str ~R .<span class="k">*</span>voice.<span class="k">*</span> <span class="o">]</span>

            A13: Variable Set <span class="o">[</span>
                  Name: %call_mode
                  To: voice
                  Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A14: Else
            If  <span class="o">[</span> %call_type_str ~R .<span class="k">*</span>video.<span class="k">*</span> <span class="o">]</span>

            A15: Variable Set <span class="o">[</span>
                  Name: %call_mode
                  To: video
                  Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A16: End If

        A17: Variable Set <span class="o">[</span>
              Name: %qs_note
              To: <span class="c">#call(%call_duration) +type[%call_type] @num[%call_number] @name[%caller_name] +mode[whatsapp-%call_mode]</span>
              Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On <span class="o">]</span>

        A18: Flash <span class="o">[</span>
              Text: %qs_note
              Continue Task Immediately: On
              Dismiss On Click: On <span class="o">]</span>

        A19: Perform Task <span class="o">[</span>
              Name: Commons: POST Note &amp; Location
              Priority: %priority
              Local Variable Passthrough: On
              Limit Passthrough To: %qs_note, %qs_ts
              Structure Output <span class="o">(</span>JSON, etc<span class="o">)</span>: On
              Continue Task After Error:On <span class="o">]</span>

        A20: Write File <span class="o">[</span>
              File: Download/wa_calls.txt
              Text: %qs_ts, %call_geocoded_location, %call_countryiso, %call_type_str, %call_number, %caller_name, %call_duration, %call_type,%call_mode
             %qs_note

              Append: On
              Add Newline: On <span class="o">]</span>

    A21: End If
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<p><br /></p>

<p>Bye.</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-self" /><summary type="html"><![CDATA[I am building a comprehensive set of tools to do life logging. General idea is:]]></summary></entry><entry><title type="html">[Mini] Life Logging</title><link href="https://trigonaminima.github.io/2025/09/life-logging/" rel="alternate" type="text/html" title="[Mini] Life Logging" /><published>2025-09-27T00:00:00+00:00</published><updated>2025-09-27T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2025/09/life-logging</id><content type="html" xml:base="https://trigonaminima.github.io/2025/09/life-logging/"><![CDATA[<p>I am building a comprehensive set of tools to do life logging. General idea is:</p>

<ul>
  <li>Push everything to a sink; and</li>
  <li>Visualise the data in this sink.</li>
</ul>

<p>Objective is to do weekly reviews and take interventions if things are not BAU. Long term vision is to eventually have enough signals to give me a comprehensive understanding of myself (physical, mental, social, financial, etc).</p>

<p>Current progress in reverse chronology:</p>

<ul>
  <li>2025: <a href="/2025/09/life-logging-calls-tasker/">Logging phone calls</a></li>
</ul>

<p>I have tried this multile times in various formats over the years. Here are my previous efforts in reverse chronology:</p>

<ul>
  <li>2023: <a href="/2023/06/google-fit-data/">Google Fit data sync and analysis</a> – my most successful attempt and still in-use. Pushed me to keep things simple.</li>
  <li>2021: <a href="/2021/08/flutter_app_3/">Futter app to do the logging 3</a> – couldn’t manage building this along with work.</li>
  <li>2021: <a href="/2021/08/flutter_app_2/">Futter app to do the logging 2</a></li>
  <li>2021: <a href="/2021/07/flutter_app_1/">Futter app to do the logging 1</a></li>
  <li>2018: <a href="/2018/04/chatting-up-2/">Analysis of my WA chats 2</a> – good analysis on my chatting habits, but nothing new. Led me to work on some solo research projects.</li>
  <li>2016: <a href="/2016/06/chatting-up/">Analysis of my WA chats</a></li>
  <li>2014: <a href="/2014/11/gamification-of-life/">Gamification of Life</a> – too much information to manage, eventually started feeling like a chore.</li>
</ul>

<p>Bye.</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-self" /><summary type="html"><![CDATA[I am building a comprehensive set of tools to do life logging. General idea is:]]></summary></entry><entry><title type="html">Consolidated Recommendation Systems</title><link href="https://trigonaminima.github.io/2025/02/consolidated-recsys/" rel="alternate" type="text/html" title="Consolidated Recommendation Systems" /><published>2025-02-13T00:00:00+00:00</published><updated>2025-02-13T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2025/02/consolidated-recsys</id><content type="html" xml:base="https://trigonaminima.github.io/2025/02/consolidated-recsys/"><![CDATA[<p>This post is a quick summary of <a href="https://netflixtechblog.medium.com/lessons-learnt-from-consolidating-ml-models-in-a-large-scale-recommendation-system-870c5ea5eb4a">Lessons Learnt From Consolidating ML Models in a Large Scale Recommendation System</a>. I have also added a few questions I got while reading it. I end the post with what we do at work to deal with this.</p>

<h2 id="summary">Summary</h2>

<ul>
  <li>Recommendation System: candidate gen + ranking.</li>
  <li>
    <p>A typical ranking model pipeline:</p>

    <ol>
      <li>Label prep</li>
      <li>Feature prep</li>
      <li>Model training</li>
      <li>Model evaluation</li>
      <li>Model deployment (with inference contract)</li>
    </ol>
  </li>
  <li>Each recommendation use case (e.g.: discover page, notifications, related items, category exploration, search) will have a version of the above pipeline.</li>
  <li>
    <p>As use cases increase, the team will need to maintain multiple such pipelines. It is time-consuming to maintain multiple pipelines and increases points of failure.</p>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2025-02/consolidated_recsys_neflix_1.webp" alt="" style="text-align: center; margin: auto" />
  <figcaption style="text-align: center">Figure 1: Figure from the Netflix blog linked at the start.</figcaption>
  </figure>
  </li>
  <li>Since the pipelines have the same component, we can consolidate them.</li>
  <li>
    <p>Consolidated pipeline:</p>

    <ol>
      <li>Label prep for each use case separately</li>
      <li>Stratified union of all the prepared labels</li>
      <li>Feature prep (separate categorical feature representing the use case)</li>
      <li>Model training</li>
      <li>Model evaluation</li>
      <li>Model deployment (with inference contract)</li>
    </ol>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2025-02/consolidated_recsys_neflix_2.webp" alt="" style="text-align: center; margin: auto" width="100" />
  <figcaption style="text-align: center">Figure 2: Figure from the Netflix blog linked at the start.</figcaption>
  </figure>
  </li>
  <li>
    <p>Label prep for each use case separately</p>

    <ol>
      <li>Each use case will have different ways of generating the labels.</li>
      <li>Use case context details are added as separate features.
        <ul>
          <li>Search context: search query, region</li>
          <li>Similar items context: source item</li>
        </ul>
      </li>
      <li>When the use case is search, context features specific to the similar item use case will be filled with default values.</li>
    </ol>
  </li>
  <li>
    <p>Union of all the prepared labels</p>

    <ol>
      <li>Final labelled set: a% samples from use case-1 labels + b% samples from use case-2 labels + … + z% samples from use case-n labels</li>
      <li>The proportions [a, b, …, z] come from stratification</li>
      <li>Q: How is this stratification done? Platform traffic across different use cases?</li>
      <li>Q: What are the results when these proportions are business-driven? Eg: contribution to revenue.</li>
    </ol>
  </li>
  <li>
    <p>Feature prep</p>

    <ol>
      <li>All use case specific features added to the data.</li>
      <li>If a feature is only used for use case 1 then it will contain default value for all the other use cases.</li>
      <li>Add a new categorical feature task_type to the features to inform the model about the target reco task.</li>
    </ol>
  </li>
  <li>Model training happens as usual: feature vector and labels. Architecture remains the same. Optimisation remains the same.</li>
  <li>
    <p>Model evaluation</p>

    <ol>
      <li>Check the appropriate eval metrics to check the model.</li>
      <li>Q: How do we judge if the model performed well for all the use cases?</li>
      <li>Q: Will it require a separate evaluation set for each use case?</li>
      <li>Q: Can there be a 2nd order Simpson’s paradox here: the consolidated model performs well, but when tried for individual use cases, its performance is low? My hunch: no.</li>
    </ol>
  </li>
  <li>
    <p>Model deployment (with inference contract)</p>

    <ol>
      <li>Deploy the same model in the respective environment made for each use case. That env will have all the specific network-related knobs: batch size, throughput, latency, caching policy, parallelism, etc.</li>
      <li>Generic API contract to support the heterogenous context (search query for search, source item for related items use case.)</li>
    </ol>
  </li>
  <li>
    <p>Caveats</p>

    <ol>
      <li>The consolidated use cases should be related (eg: ranking for movies in the search and discover page)</li>
      <li>One definition of related can be: ranking the same entities.</li>
    </ol>
  </li>
  <li>
    <p>Advantages</p>

    <ol>
      <li>Reduces maintenance costs (less code; fewer deployments)</li>
      <li>Quick model iterations to all the use cases
        <ul>
          <li>Updates (new features, architecture, etc) for one use case can be applied to other use cases.</li>
          <li>If consolidated tasks are related, then new features don’t cause regression in practice.</li>
        </ul>
      </li>
      <li>Can be extended to any related use case from offline and online POV.</li>
      <li>Cross-learning: the model potentially gains more (hidden) learning from the other tasks. Eg: having search data gives more data to the model learning for related-items task.
        <ul>
          <li>Q: Is this happening? How can we verify this? One way: Train an independent model on the use-case specific data and compare its performance with the consolidated model’s performance on the same task.</li>
        </ul>
      </li>
    </ol>
  </li>
  <li>I was confused about what to call this learning paradigm. <a href="https://en.wikipedia.org/wiki/Multi-task_learning">Wikipedia</a> says that it is multi-task learning.</li>
</ul>

<h2 id="practice-at-my-work">Practice at my work</h2>

<ul>
  <li>The models are not merged across different tasks like relevance and search.</li>
  <li>Within relevance ranking tasks (discover, similar items, category exploration), have a common base ranker model.</li>
  <li>On top of that, we have different heuristics to make it better for that particular section.</li>
  <li>Advantages:
    <ul>
      <li>There is only one main model for all related tasks.</li>
      <li>Keeps the heuristics logic simple and, thus, easy to maintain.</li>
    </ul>
  </li>
  <li>Challenges
    <ul>
      <li>Heuristics are crude/manual/semi-automated → we may be leaving some gains on the table. There are bandit-based approaches to automating it, though.</li>
      <li>It loses out on cross-learning opportunities.</li>
    </ul>
  </li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="RecSys" /><summary type="html"><![CDATA[This post is a quick summary of Lessons Learnt From Consolidating ML Models in a Large Scale Recommendation System. I have also added a few questions I got while reading it. I end the post with what we do at work to deal with this.]]></summary></entry><entry><title type="html">Document Your Progress at Work</title><link href="https://trigonaminima.github.io/2025/01/document-your-progress/" rel="alternate" type="text/html" title="Document Your Progress at Work" /><published>2025-01-13T00:00:00+00:00</published><updated>2025-01-13T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2025/01/document-your-progress</id><content type="html" xml:base="https://trigonaminima.github.io/2025/01/document-your-progress/"><![CDATA[<p><strong>How can you ensure that your contributions are also recognized?</strong></p>

<p>A common challenge, especially in larger organizations, is that your manager may not always be fully aware of the specifics of your work, and your manager’s manager likely has even less visibility. It isn’t due to a lack of interest but rather the sheer volume of responsibilities and information they handle. Additionally, even for you, it’s hard to remember all the details beyond the highlights. I find a proactive strategy essential for such scenarios: sending <strong>regular progress digests.</strong></p>

<p>These digests are concise, structured email updates that you send periodically to both direct manager and their manager. The aim is to offer a clear snapshot of your activities, their impact, and your forthcoming plans. See it as a method to keep your supervisors well-informed, especially when you lack regular direct interactions.</p>

<p>That’s it. That is the idea. You can be creative and apply it however you want. However you decide to do it, you will see gains.</p>

<p>In the next section, I list the <strong>key points</strong> I usually consider in my snapshots.</p>

<h2 id="key-elements-of-an-effective-progress-digest">Key Elements of an Effective Progress Digest</h2>

<p>To ensure your digests are both informative and impactful, here’s what you can include:</p>

<ul>
  <li><strong>Specific Task Details</strong>: Provide project specifics and relevant links to the completed/picked coding tasks. It entails a 1-sentence project description, PR links, JIRA tickets and other code artefacts.</li>
  <li><strong>Data Science Related</strong>: If applicable, detail the models you’ve trained and deployed. Any A/B experiments launched and test results of the ones that concluded. Also, share the project solutioning doc here.</li>
  <li><strong>Documentation Efforts</strong>: Highlight any documentation you’ve created or maintained. You can also merge this with other points.</li>
  <li><strong>Impact and Results</strong>: Clearly articulate the outcomes of your tasks and their value to the team and company.</li>
  <li><strong>Initiatives and Discussions</strong>: Share any new ideas you’ve put forward or discussions you’ve initiated.</li>
  <li><strong>Future Plans</strong>: Outline your planned next steps.</li>
</ul>

<h2 id="benefits">Benefits</h2>

<p>The effort invested in creating these digests yields substantial career benefits:</p>

<ul>
  <li><strong>Enhances Diligence</strong>: Summarizing your work makes you more conscious of your efforts.</li>
  <li><strong>Boosts Positive Perception</strong>: You are perceived as a proactive and accomplished individual.</li>
  <li><strong>Creates a Performance Record</strong>: These digests serve as valuable documentation of your work, valuable during performance reviews.</li>
  <li><strong>Ensures Visibility</strong>: Even if managers don’t respond directly to each email, they will read them, which ensures they are aware of your work and its progress.</li>
  <li><strong>Effective at Any Stage:</strong> While this practice is advantageous when starting a new job (or joining a new team), I have found it beneficial at any stage.</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>Actively managing your visibility is key to long-term career growth. Sending out regular progress digests ensures that your work is recognized. You also establish a record of your accomplishments and demonstrate your value. This practice requires regular work but has good returns.</p>

<p>PS: I learned this trick on a tech podcast many years ago. If anyone knows which podcast or episode, please share it with me, and I will link it here.</p>

<p><strong>Update: 14th Jan</strong></p>

<p>PS: A related idea of <a href="https://jvns.ca/blog/brag-documents/">brag documents</a> explained beautifully by <a href="https://jvns.ca/">Julia Evans</a>. Shared on this <a href="https://news.ycombinator.com/item?id=42695837">HN comment</a>.</p>]]></content><author><name>Shivam Rana</name></author><summary type="html"><![CDATA[How can you ensure that your contributions are also recognized?]]></summary></entry><entry><title type="html">Confidence Intervals and Coverage</title><link href="https://trigonaminima.github.io/2024/09/confidence-intervals-and-coverage/" rel="alternate" type="text/html" title="Confidence Intervals and Coverage" /><published>2024-09-15T00:00:00+00:00</published><updated>2024-09-15T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2024/09/confidence-intervals-and-coverage</id><content type="html" xml:base="https://trigonaminima.github.io/2024/09/confidence-intervals-and-coverage/"><![CDATA[<h2 id="confidence-interval-ci">Confidence Interval (CI)</h2>

<ul>
  <li>CI is an interval.</li>
  <li>An interval which is exepected to contain the parameter being estimated (eg: population mean.)</li>
  <li>Typical confidence levels are 95% and 99%.</li>
  <li>The confidence level of a confidence interval is called Nominal coverage (probability.)</li>
  <li>CI with 95% confidence: random interval which contains the parameter to be estimated 95% of the time.</li>
  <li>Two ways to mention a confidence level of 95%:
    <ul>
      <li>Confidence interval with \(\gamma = 0.95\); 95% confidence</li>
      <li>Confidence interval with \(\alpha = 0.05\); 95% confidence: \(1-\alpha = 0.95\)</li>
    </ul>
  </li>
  <li>
    <p>Mathematical representation</p>

\[P(u(X)&lt;\theta &lt;v(X))=\gamma\]

    <ul>
      <li>\(\theta\) is the parameter to be estimated (eg: population mean or median).</li>
      <li>\(X\) is a random variable from a probability distribution with parameter \(\theta\)</li>
      <li>\(u(X)\) and \(v(X)\) are random variables containing parameter \(\theta\) with probability \(\gamma\)</li>
      <li>Confidence level \(\gamma\) &lt; 1 (but close to 1). eg: 0.95</li>
    </ul>
  </li>
  <li>
    <p>Mathematical representation in case of normal distribution:</p>

\[\text{CI} = \bar{x} \pm z^* \left(\frac{\sigma}{\sqrt{n}}\right)\]

    <p>Where:</p>
    <ul>
      <li>\(\bar{x}\) is the sample mean.</li>
      <li>\(z^*\) is the critical value corresponding to the desired confidence level</li>
      <li>\(\sigma\) is the population standard deviation.</li>
      <li>\(n\) is the sample size.</li>
      <li>The quantity \(\displaystyle {\sigma }_{\bar {x}}={\frac {\sigma }{\sqrt {n}}}\) is also called the <a href="https://en.wikipedia.org/wiki/Standard_error">standard error of the mean</a>.</li>
      <li>The 95% confidence level will correspond to the 97.5th percentile of the distribution. Reason: the probability of \(\theta\) lying outside the 95% confidence level is 5%. So, 2.5% probability on both sides (if symmetric). So the range becomes 2.5% to (95+2.5)%.</li>
    </ul>
  </li>
  <li>We can calculate the critical value \(z^*\) as follows:
    <ul>
      <li>If the sample size is small (&lt; 30) or we do not know the std dev, then we use the t-statistic (Student’s t-distribution.) The t-distribution is wider and has heavier tails than the normal distribution, reflecting the increased uncertainty in small samples. Thus, it accounts for the extra variability.</li>
      <li>If the sample size is large enough to make CLT valid, we use normal distribution (Z-distribution) –&gt; z-score. Eg: a z-score of 1.96 for a 95% confidence level.</li>
      <li>As the sample size increases, both methods converge.</li>
      <li>It is better to use the t-statistic.</li>
    </ul>
  </li>
  <li>Ref:
    <ul>
      <li><a href="https://en.wikipedia.org/wiki/Confidence_interval">Confidence interval</a> wiki</li>
      <li><a href="https://en.wikipedia.org/wiki/Confidence_interval#Interpretation">Interpretation</a></li>
      <li><a href="https://en.wikipedia.org/wiki/Confidence_interval#Common_misunderstandings">Common misunderstandings</a></li>
    </ul>
  </li>
</ul>

<h2 id="ci-width">CI Width</h2>

<ul>
  <li>The narrower the width, the higher the confidence.</li>
  <li>Factors that impact the width of CI are sample size, variance/standard deviation, and confidence level.
    <ul>
      <li>Sample size high –&gt; narrow CI</li>
      <li>High variance/standard dev –&gt; wider CI</li>
      <li>Higher confidence level –&gt; wider CI (more data will lie under a higher confidence level)</li>
    </ul>
  </li>
</ul>

<h2 id="coverage">Coverage</h2>

<ul>
  <li>Coverage (probability): the probability that a confidence interval will include the true value (eg: population mean.)</li>
  <li>The proportion of CIs (at a particular confidence level) that contain the true value (eg: population mean.)</li>
  <li>95% CI coverage: For example, if you calculate a 95% confidence interval for a population mean, you are saying that if you were to take many samples and calculate a confidence interval from each one, approximately 95% of those intervals would contain the true population mean.</li>
  <li>Probability matching: if coverage probability is the same as nominal coverage probability.
<img src="https://trigonaminima.github.io/assets/2024-09/coverage_probability.png" alt="" width="500" style="text-align: center; margin: auto" />
    <ul>
      <li>Nominal coverage = 50%</li>
      <li>Coverage = 10/20 = 50% (blue CIs contain the true mean)</li>
      <li>Probability matching since coverage is the same as nominal coverage.</li>
      <li><a href="https://en.wikipedia.org/wiki/File:Normal_distribution_50%25_CI_illustration.svg">Image ref</a></li>
    </ul>
  </li>
  <li>Ref:
    <ul>
      <li><a href="https://en.wikipedia.org/wiki/Coverage_probability">Coverage probability</a> wiki</li>
      <li><a href="https://en.wikipedia.org/wiki/Neyman_construction">Confidence interval construction</a></li>
    </ul>
  </li>
</ul>

<h2 id="implementation-and-explorations">Implementation and Explorations</h2>

<p>Now, we will go through the above concepts in code.</p>

<details close="">
<summary>Common imports</summary>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">scipy.stats</span> <span class="k">as</span> <span class="n">st</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>
<p><br /></p>

<h3 id="compute-ci">Compute CI</h3>

<p>We implement the t-distribution and standard normal distribution to calculate the critical value.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">confidence_interval_t</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span> <span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">):</span>
    <span class="s">"""
    Calculate the confidence interval for the mean of a sample using the t-distribution.

    This function is appropriate when the population standard deviation is unknown and
    the sample size is small (n &lt; 30), although it works for any sample size.

    Parameters:
    sample (numpy.ndarray): The sample data as a NumPy array.
    confidence (float): The desired confidence level (default
    is 0.95 for a 95% confidence interval).

    Returns:
    tuple: Lower and upper bounds of the confidence interval.
    """</span>
    <span class="c1"># Ensure the sample is a NumPy array
</span>    <span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>

    <span class="n">sample_mean</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
    <span class="c1"># Use Bessel's correction (ddof=1) for sample standard deviation
</span>    <span class="n">sample_std</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">ddof</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">sample_size</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
    <span class="n">standard_error</span> <span class="o">=</span> <span class="n">sample_std</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">sample_size</span><span class="p">)</span>

    <span class="c1"># Determine the critical value for the specified confidence level
</span>    <span class="n">critical_value</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">t</span><span class="p">.</span><span class="n">ppf</span><span class="p">((</span><span class="mi">1</span> <span class="o">+</span> <span class="n">confidence</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">,</span> <span class="n">df</span><span class="o">=</span><span class="n">sample_size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">margin_of_error</span> <span class="o">=</span> <span class="n">critical_value</span> <span class="o">*</span> <span class="n">standard_error</span>

    <span class="n">lower_bound</span> <span class="o">=</span> <span class="n">sample_mean</span> <span class="o">-</span> <span class="n">margin_of_error</span>
    <span class="n">upper_bound</span> <span class="o">=</span> <span class="n">sample_mean</span> <span class="o">+</span> <span class="n">margin_of_error</span>

    <span class="k">return</span> <span class="n">lower_bound</span><span class="p">,</span> <span class="n">upper_bound</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>We use <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html"><code class="language-plaintext highlighter-rouge">stats.t.ppf</code></a> to get the critical value using the t-distribution. We can replace that with <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html"><code class="language-plaintext highlighter-rouge">stats.norm.ppf</code></a> for the z-score.</p>

<details close="">
<summary>Confidence interval using standard normal distribution</summary>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">confidence_interval_norm</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span> <span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">):</span>
    <span class="s">"""
    Calculate the confidence interval for the mean of a sample using the normal
    distribution (Z-distribution).

    This function is appropriate when the population standard deviation is
    known, or when the sample size is large (n &gt;= 30), allowing the
    Central Limit Theorem to approximate the sample mean's distribution as normal.

    Parameters:
    sample (numpy.ndarray): The sample data as a NumPy array.
    confidence (float): The desired confidence level (default
    is 0.95 for a 95% confidence interval).

    Returns:
    tuple: Lower and upper bounds of the confidence interval.
    """</span>
    <span class="c1"># Ensure the sample is a NumPy array
</span>    <span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>

    <span class="n">sample_mean</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
    <span class="c1"># Use Bessel's correction (ddof=1) for sample standard deviation
</span>    <span class="n">sample_std</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">ddof</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">sample_size</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
    <span class="n">standard_error</span> <span class="o">=</span> <span class="n">sample_std</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">sample_size</span><span class="p">)</span>

    <span class="c1"># Determine the critical value for the specified confidence level
</span>    <span class="n">critical_value</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">norm</span><span class="p">.</span><span class="n">ppf</span><span class="p">((</span><span class="mi">1</span> <span class="o">+</span> <span class="n">confidence</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span>
    <span class="n">margin_of_error</span> <span class="o">=</span> <span class="n">critical_value</span> <span class="o">*</span> <span class="n">standard_error</span>

    <span class="n">lower_bound</span> <span class="o">=</span> <span class="n">sample_mean</span> <span class="o">-</span> <span class="n">margin_of_error</span>
    <span class="n">upper_bound</span> <span class="o">=</span> <span class="n">sample_mean</span> <span class="o">+</span> <span class="n">margin_of_error</span>

    <span class="k">return</span> <span class="n">lower_bound</span><span class="p">,</span> <span class="n">upper_bound</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>
<p><br />
Let’s compare the results with scipy implementations.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random_sample</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"defined functions:"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"tstat:</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">confidence_interval_t</span><span class="p">(</span><span class="n">sample</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"norm:</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">confidence_interval_norm</span><span class="p">(</span><span class="n">sample</span><span class="p">))</span>

<span class="k">print</span><span class="p">(</span><span class="s">"scipy functions:"</span><span class="p">)</span>
<span class="n">interval</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">t</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span>
    <span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span> <span class="n">df</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span> <span class="n">scale</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"tstat:</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">interval</span><span class="p">)</span>

<span class="n">interval</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">norm</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span><span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span> <span class="n">scale</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"norm:</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">interval</span><span class="p">)</span>

<span class="c1"># defined functions:
# tstat: (0.2756144976802315, 0.7458592632198344)
# norm:	 (0.3070236240157737, 0.7144501368842922)
# scipy functions:
# tstat: (0.2756144976802315, 0.7458592632198344)
# norm:	 (0.3070236240157737, 0.7144501368842922)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>It is the same. In the following sections, we will use the scipy functions.</p>

<h3 id="ci-t-distribution-vs-ci-z-distribution">CI T-distribution vs. CI Z Distribution</h3>

<p>We will verify if the confidence interval converges as the sample size increases in both methods. We will try both on the samples generated using the following sampling methods:</p>

<ul>
  <li>Uniform</li>
  <li>Standard normal</li>
  <li>Poisson</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
</pre></td><td class="code"><pre><span class="n">sample_sizes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">sample_means</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">t_interval_95_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t_interval_95_r</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">norm_interval_95_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">norm_interval_95_r</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">sample_size</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">5</span><span class="p">):</span>
    <span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random_sample</span><span class="p">(</span><span class="n">sample_size</span><span class="p">)</span>
    <span class="n">t_interval_95</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">t</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span>
        <span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span> <span class="n">df</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span> <span class="n">scale</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="n">norm_interval_95</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">norm</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span>
        <span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span> <span class="n">scale</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="n">sample_sizes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">sample_size</span><span class="p">)</span>
    <span class="n">sample_means</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">))</span>
    <span class="n">t_interval_95_l</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval_95</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
    <span class="n">t_interval_95_r</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval_95</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">norm_interval_95_l</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">norm_interval_95</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
    <span class="n">norm_interval_95_r</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">norm_interval_95</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="c1"># print(sample_size, t_interval_95, norm_interval_95)
</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span>
    <span class="n">sample_sizes</span><span class="p">,</span> <span class="n">t_interval_95_l</span><span class="p">,</span> <span class="n">t_interval_95_r</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"b"</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.4</span>
<span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span>
    <span class="n">sample_sizes</span><span class="p">,</span> <span class="n">norm_interval_95_l</span><span class="p">,</span> <span class="n">norm_interval_95_r</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"r"</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.4</span>
<span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Sample size"</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"CI upper and lower bounds"</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Uniform Distribution"</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Replace <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html"><code class="language-plaintext highlighter-rouge">np.random.random_sample</code></a> with <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.standard_normal.html"><code class="language-plaintext highlighter-rouge">np.random.standard_normal</code></a> and <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.poisson.html"><code class="language-plaintext highlighter-rouge">np.random.poisson</code></a> to get standard normal and the Poisson random samples.</p>

<p>Here are the results:</p>

<figure class="third t_vs_std_norm gallery-popup">
  
  
  <a href="/assets/2024-09/t_norm_ci_uniform_sample.png" title="Blue part is t-distribution. Red part is standard normal distribution. For each sample drawn from a uniform distribution, the CI bounds are plotted. Blue is wider than red and then they merge as sample size gets large enough." data-count="3" aria-label="Blue part is t-distribution. Red part is standard normal distribution. For each sample drawn from a uniform distribution, the CI bounds are plotted. Blue is wider than red and then they merge as sample size gets large enough.">
    <img src="/assets/2024-09/t_norm_ci_uniform_sample.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2024-09/t_norm_ci_normal_sample.png" title="Blue part is t-distribution. Red part is standard normal distribution. For each sample drawn from a normal distribution, the CI bounds are plotted. Blue is wider than red and then they merge as sample size gets large enough." aria-label="Blue part is t-distribution. Red part is standard normal distribution. For each sample drawn from a normal distribution, the CI bounds are plotted. Blue is wider than red and then they merge as sample size gets large enough.">
    <img src="/assets/2024-09/t_norm_ci_normal_sample.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2024-09/t_norm_ci_poisson_sample.png" title="Blue part is t-distribution. Red part is standard normal distribution. For each sample drawn from a Poisson distribution, the CI bounds are plotted. Blue is wider than red and then they merge as sample size gets large enough." aria-label="Blue part is t-distribution. Red part is standard normal distribution. For each sample drawn from a Poisson distribution, the CI bounds are plotted. Blue is wider than red and then they merge as sample size gets large enough.">
    <img src="/assets/2024-09/t_norm_ci_poisson_sample.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<p>In each figure (click to zoom), the blue part corresponds to t-distribution-based CI, and the red part is to standard normal based CI. We can observe that:</p>

<ol>
  <li>t-distribution based CIs are wider than standard normal based CI.</li>
  <li>As the sample size increases, both converge.</li>
</ol>

<h3 id="ci-width-simulations">CI Width Simulations</h3>

<p>Let’s visualise how the CI width changes with different factors: confidence level, sample size, and standard deviation or variance.</p>

<h4 id="confidence-level---gamma">Confidence Level - \(\gamma\)</h4>

<p>The width of CI increases as the CI level increases. Intuition: as the confidence level increases, we widen the range to get the upper and lower limits of the confidence interval.</p>

<details collapse="">
<summary>Code</summary>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre><span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">standard_normal</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
<span class="n">sample_mean</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
<span class="n">sample_sem</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>

<span class="n">cis</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t_interval_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t_interval_r</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">ci</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">5</span><span class="p">):</span>
    <span class="n">ci</span> <span class="o">=</span> <span class="n">ci</span> <span class="o">*</span> <span class="mf">0.01</span>
    <span class="n">t_interval</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">t</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span>
        <span class="n">confidence</span><span class="o">=</span><span class="n">ci</span><span class="p">,</span> <span class="n">df</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">sample_mean</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">sample_sem</span>
    <span class="p">)</span>
    <span class="n">cis</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">ci</span><span class="p">)</span>
    <span class="n">t_interval_l</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
    <span class="n">t_interval_r</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">cis</span><span class="p">,</span> <span class="n">t_interval_l</span><span class="p">,</span> <span class="n">t_interval_r</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"g"</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.4</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"CI level"</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"CI upper and lower bounds"</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<figure class=" ci_width_ci gallery-popup">
  
  
  <a href="/assets/2024-09/ci_level_with_ci.png" title="As confidence level increases from 0 to 100%, confidence interval widens. When confidence level is 0, there will not be any CI. When confodence level is 100%, CI will contain all the data." aria-label="As confidence level increases from 0 to 100%, confidence interval widens. When confidence level is 0, there will not be any CI. When confodence level is 100%, CI will contain all the data.">
    <img src="/assets/2024-09/ci_level_with_ci.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<h4 id="sample-size">Sample Size</h4>

<p>The width of CI reduces with as the sample size increases. Intuition: as sample size increases, we get more confident in our normal distribution parameter estimation, and thus, the confidence interval width reduces.</p>

<details collapse="">
<summary>Code</summary>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
</pre></td><td class="code"><pre><span class="n">sample_sizes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">sample_means</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">t_interval_95_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t_interval_95_r</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">norm_interval_95_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">norm_interval_95_r</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">sample_size</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10000</span><span class="p">,</span> <span class="mi">10</span><span class="p">):</span>
    <span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">standard_normal</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">sample_size</span><span class="p">)</span>
    <span class="n">t_interval_95</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">t</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span>
        <span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span> <span class="n">df</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span> <span class="n">scale</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="n">sample_sizes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">sample_size</span><span class="p">)</span>
    <span class="n">sample_means</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">))</span>
    <span class="n">t_interval_95_l</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval_95</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
    <span class="n">t_interval_95_r</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval_95</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span>
    <span class="n">sample_sizes</span><span class="p">,</span> <span class="n">t_interval_95_l</span><span class="p">,</span> <span class="n">t_interval_95_r</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"r"</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.4</span>
<span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Sample size"</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"CI upper and lower bounds"</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<figure class=" ci_width_sample_size gallery-popup">
  
  
  <a href="/assets/2024-09/ci_vs_sample_size.png" title="When sample size increases, the confidence interval becomes narrow and more centered around mean, 0.0." aria-label="When sample size increases, the confidence interval becomes narrow and more centered around mean, 0.0.">
    <img src="/assets/2024-09/ci_vs_sample_size.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<h4 id="standard-deviation-variance">Standard Deviation (Variance)</h4>

<p>As with confidence level as the variance increases, we have more dispersion in the data. That leads to a wider CI width.</p>

<details collapse="">
<summary>Code</summary>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code"><pre><span class="n">stds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">sample_means</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">t_interval_95_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t_interval_95_r</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">norm_interval_95_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">norm_interval_95_r</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">std</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="mi">1</span><span class="p">):</span>
    <span class="n">std</span> <span class="o">=</span> <span class="n">std</span> <span class="o">*</span> <span class="mf">0.01</span>
    <span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">std</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
    <span class="n">t_interval_95</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">t</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span>
        <span class="n">confidence</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span> <span class="n">df</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span> <span class="n">scale</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="n">stds</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">std</span><span class="p">)</span>
    <span class="n">t_interval_95_l</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval_95</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
    <span class="n">t_interval_95_r</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval_95</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">stds</span><span class="p">,</span> <span class="n">t_interval_95_l</span><span class="p">,</span> <span class="n">t_interval_95_r</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"r"</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.4</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Standard Deviation"</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"CI upper and lower bounds"</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<figure class=" ci_width_std gallery-popup">
  
  
  <a href="/assets/2024-09/ci_vs_stddev.png" title="The x-axis shows the standard deviation of a normal distribution going from 0 to 2. As the standard deviation increases (meaning more vairance in the data,) the 95% confidence interval on the sample widens, and thus, more unreliable." aria-label="The x-axis shows the standard deviation of a normal distribution going from 0 to 2. As the standard deviation increases (meaning more vairance in the data,) the 95% confidence interval on the sample widens, and thus, more unreliable.">
    <img src="/assets/2024-09/ci_vs_stddev.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<h3 id="probability-matching">Probability Matching</h3>

<p>Coverage probability will not always be the same as nominal coverage probability. When it matches, we get probability matching. In the below figure, out of 100 confidence intervals, 7 CIs do not contain the true mean (black.) Thus, we get a coverage of 93%, which is not the same as 95%, hence no probability matching.</p>

<details collapse="">
<summary>Code</summary>


<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
</pre></td><td class="code"><pre><span class="n">population</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">standard_normal</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">100000</span><span class="p">)</span>

<span class="n">ci_ids</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t_interval_l</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t_interval_r</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">ci_contains_true_value</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">ci_level</span> <span class="o">=</span> <span class="mf">0.95</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
    <span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">population</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">t_interval</span> <span class="o">=</span> <span class="n">st</span><span class="p">.</span><span class="n">t</span><span class="p">.</span><span class="n">interval</span><span class="p">(</span>
        <span class="n">confidence</span><span class="o">=</span><span class="n">ci_level</span><span class="p">,</span>
        <span class="n">df</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span>
        <span class="n">loc</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span>
        <span class="n">scale</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">sem</span><span class="p">(</span><span class="n">sample</span><span class="p">),</span>
    <span class="p">)</span>
    <span class="n">ci_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">t_interval_l</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
    <span class="n">t_interval_r</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t_interval</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>

    <span class="k">if</span> <span class="n">t_interval</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mi">0</span> <span class="o">&lt;=</span> <span class="n">t_interval</span><span class="p">[</span><span class="mi">1</span><span class="p">]:</span>
        <span class="n">ci_contains_true_value</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">ci_contains_true_value</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

<span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">"g"</span> <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">1</span> <span class="k">else</span> <span class="s">"red"</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ci_contains_true_value</span><span class="p">]</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">ci_ids</span><span class="p">,</span> <span class="n">t_interval_l</span><span class="p">,</span> <span class="n">t_interval_r</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">cols</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"black"</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"CI upper and lower bounds"</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">([])</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span>
    <span class="sa">f</span><span class="s">"Coverage = </span><span class="si">{</span><span class="nb">sum</span><span class="p">(</span><span class="n">ci_contains_true_value</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">ci_contains_true_value</span><span class="p">)</span><span class="si">}</span><span class="s">%"</span>
    <span class="sa">f</span><span class="s">" (Nominal Coverage = </span><span class="si">{</span><span class="n">ci_level</span><span class="o">*</span><span class="mi">100</span><span class="si">}</span><span class="s">%)"</span>
<span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<figure class=" coverage_probab gallery-popup">
  
  
  <a href="/assets/2024-09/coverage_probability_2.png" title="Each line represents a 95% CI on a random sample. Out of 100 confidence intervals, 7 CIs (marked in red) do not contain the true mean (black line.) Thus, we get a coverage of 93%, which is not the same as 95%, hence no probability matching." aria-label="Each line represents a 95% CI on a random sample. Out of 100 confidence intervals, 7 CIs (marked in red) do not contain the true mean (black line.) Thus, we get a coverage of 93%, which is not the same as 95%, hence no probability matching.">
    <img src="/assets/2024-09/coverage_probability_2.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<h3 id="conclusion">Conclusion</h3>

<p>A Confidence Interval is an interval over a sample that is expected to contain the distribution parameter that we are trying to estimate (eg: mean.) That means that all CIs will not contain the mean. The sample and population mean could have the following properties:</p>

<ol>
  <li>It does not contain the mean</li>
  <li>It contains the mean somewhere in the middle</li>
  <li>It contains the mean but as an outlier</li>
</ol>

<p>Since the confidence interval is built from this sample using normal distribution the confidence interval may not contain the mean in the 1st or the 3rd scenario. That is why we take the confidence level as 95% or more to handle the 3rd scenario (demonstrated in the simulation section).</p>

<p>Since narrow-width confidence intervals are better (and more reliable), we should try to</p>

<ul>
  <li>take higher confidence levels (95% or more);</li>
  <li>have a bigger sample size; and</li>
  <li>have less variance in the data.</li>
</ul>

<p><br /></p>]]></content><author><name>Shivam Rana</name></author><category term="ML" /><summary type="html"><![CDATA[Confidence Interval (CI)]]></summary></entry><entry><title type="html">Lognormal to Normal Distribution</title><link href="https://trigonaminima.github.io/2024/01/lognormal-to-normal/" rel="alternate" type="text/html" title="Lognormal to Normal Distribution" /><published>2024-01-14T00:00:00+00:00</published><updated>2024-01-14T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2024/01/lognormal-to-normal</id><content type="html" xml:base="https://trigonaminima.github.io/2024/01/lognormal-to-normal/"><![CDATA[<p>The Normal and lognormal distributions are fundamental concepts in statistics. I recently used the relationship between these two distributions in a project. In this blog post, I want to share what I learned.</p>

<p>Outline</p>

<ol>
  <li><a href="#dist">Normal &amp; Lognormal Distributions</a></li>
  <li><a href="#log2normal">Lognormal to Normal</a></li>
  <li><a href="#normal2log">Normal to Lognormal</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ol>

<h2 id="normal--lognormal-distributions">Normal &amp; Lognormal Distributions<a name="dist"></a></h2>

<p>The normal distribution is also called the bell curve or Gaussian distribution. The bell height represents the mean position, and the bottom width of the bell represents the spread of values (standard deviation). Thus, the shape changes as we change mu (\(\mu\)) and sigma (\(\sigma\)). The \(\mu\) is the mean or average of the sample, and \(\sigma\) is the standard deviation. We denote a normal distribution as:</p>

\[{\mathcal {N}}(\mu ,\sigma ^{2})\]

<p>Find more details about the normal distribution on <a href="https://en.wikipedia.org/wiki/Normal_distribution">Wikipedia</a>. Here are two ways of defining a normal distribution in Python.</p>

<ul>
  <li>Using python stdlib</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="kn">from</span> <span class="nn">statistics</span> <span class="kn">import</span> <span class="n">NormalDist</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="p">.</span><span class="mi">5</span>
<span class="n">norm_dist</span> <span class="o">=</span> <span class="n">NormalDist</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<ul>
  <li>Using scipy</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">scipy.stats</span> <span class="k">as</span> <span class="n">stats</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="p">.</span><span class="mi">5</span>
<span class="n">norm_dist</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p><br /></p>

<p>We get a lognormal distribution when we apply exponentiation to the normal distribution. The result is a lopsided curve. It means that there is a longer tail on the right side, where larger values occur. We denote the lognormal distribution as follows:</p>

\[{\displaystyle \ X\sim \operatorname {Lognormal} \left(\ \mu _{x},\sigma _{x}^{2}\ \right)\ }\]

<p>Since the log of the lognormal distribution is a normal distribution, we can denote the relationship as follows:</p>

\[{\displaystyle \ln(X)\sim {\mathcal {N}}(\mu ,\sigma ^{2})}\]

<p>Find more details about the lognormal distribution on <a href="https://en.wikipedia.org/wiki/Log-normal_distribution">Wikipedia</a>. We define a lognormal distribution in Python as follows. The Python stdlib does not have a lognormal implementation.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">scipy.stats</span> <span class="k">as</span> <span class="n">stats</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="p">.</span><span class="mi">5</span>
<span class="n">norm_dist</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">lognorm</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">mu</span><span class="p">))</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Note: the <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html"><code class="language-plaintext highlighter-rouge">scipy.stats.lognorm</code></a> takes mu and sigma of the underlying <em>normal distribution</em> from which we derive the lognormal distribution. While providing the <code class="language-plaintext highlighter-rouge">scale</code> parameter, we take the exponentiation of the mean of the normal distribution. I found the documentation inadequate in explaining the parameters. This <a href="https://stackoverflow.com/q/8870982/2650427">SO question</a> has answers that discuss the meaning of the parameters.</p>

<p><br />
Here is how both the distributions look for the same mu (\(\mu\)) and sigma (\(\sigma\)).</p>

<details close="">
<summary>Code to generate the below plot.</summary>


<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">scipy.stats</span> <span class="k">as</span> <span class="n">stats</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>


<span class="c1"># all distributions
</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="p">.</span><span class="mi">5</span>
<span class="n">norm_d1</span> <span class="o">=</span> <span class="n">NormalDist</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="n">lognorm_d1</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">lognorm</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">mu</span><span class="p">))</span>
<span class="n">lognorm_d1</span><span class="p">.</span><span class="n">mu</span><span class="p">,</span> <span class="n">lognorm_d1</span><span class="p">.</span><span class="n">sigma</span> <span class="o">=</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span>

<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">1</span>
<span class="n">norm_d2</span> <span class="o">=</span> <span class="n">NormalDist</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="n">lognorm_d2</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">lognorm</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">mu</span><span class="p">))</span>
<span class="n">lognorm_d2</span><span class="p">.</span><span class="n">mu</span><span class="p">,</span> <span class="n">lognorm_d2</span><span class="p">.</span><span class="n">sigma</span> <span class="o">=</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span>

<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">4</span><span class="p">,</span> <span class="mf">0.3</span>
<span class="n">norm_d3</span> <span class="o">=</span> <span class="n">NormalDist</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="n">lognorm_d3</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">lognorm</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">mu</span><span class="p">))</span>
<span class="n">lognorm_d3</span><span class="p">.</span><span class="n">mu</span><span class="p">,</span> <span class="n">lognorm_d3</span><span class="p">.</span><span class="n">sigma</span> <span class="o">=</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span>

<span class="c1"># norm y
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span>
<span class="n">norm_y1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">norm_d1</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>
<span class="n">norm_y2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">norm_d2</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>
<span class="n">norm_y3</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">norm_d3</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>

<span class="c1"># lognorm y
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">800</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span>
<span class="n">lognorm_y1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">lognorm_d1</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>
<span class="n">lognorm_y2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">lognorm_d2</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>
<span class="n">lognorm_y3</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">lognorm_d3</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>


<span class="c1"># Set the figsize
</span><span class="n">fig1</span><span class="p">,</span> <span class="n">ax1</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm_y1</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">norm_d1</span><span class="p">.</span><span class="n">mean</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">norm_d1</span><span class="p">.</span><span class="n">stdev</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm_y2</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">norm_d2</span><span class="p">.</span><span class="n">mean</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">norm_d2</span><span class="p">.</span><span class="n">stdev</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm_y3</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">norm_d3</span><span class="p">.</span><span class="n">mean</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">norm_d3</span><span class="p">.</span><span class="n">stdev</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="n">fig2</span><span class="p">,</span> <span class="n">ax2</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">lognorm_y1</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">lognorm_d1</span><span class="p">.</span><span class="n">mu</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">lognorm_d1</span><span class="p">.</span><span class="n">sigma</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">lognorm_y2</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">lognorm_d2</span><span class="p">.</span><span class="n">mu</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">lognorm_d2</span><span class="p">.</span><span class="n">sigma</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">lognorm_y3</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">lognorm_d3</span><span class="p">.</span><span class="n">mu</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">lognorm_d3</span><span class="p">.</span><span class="n">sigma</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>

<span class="n">fig1</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'norm_dist.svg'</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'svg'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
<span class="n">fig2</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'lognorm_dist.svg'</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'svg'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>


For normal distribution: Instead of using the <code>NormalDist.pdf()</code> we can also use <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.normal.html"><code>numpy.random.Generator.normal</code></a> to get a normal distribution sample and plot a histogram. Similarly, for lognormal distribution, instead of <code>stats.lognorm.pdf()</code>, we can use <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.lognormal.html"><code>numpy.random.Generator.lognormal</code></a>.

</details>

<figure class="half distributions gallery-popup">
  
  
  <a href="/assets/2024-01/norm_dist.svg" data-count="2" aria-label="Open image 1 of 2">
    <img src="/assets/2024-01/norm_dist.svg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2024-01/lognorm_dist.svg" aria-label="Open image 2 of 2">
    <img src="/assets/2024-01/lognorm_dist.svg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <figcaption>Normal and lognormal distributions with different mu and sigma.</figcaption>
  
</figure>

<h2 id="lognormal-to-normal">Lognormal to Normal<a name="log2normal"></a></h2>

<p>As mentioned in the previous section, normal distribution is just a log of the lognormal distribution. So, if \({\displaystyle \ X\sim \operatorname {Lognormal} \left(\mu _{x},\sigma _{x}^{2} \right)}\), then \({\ \displaystyle \ln(X)\sim {\mathcal {N}}(\mu ,\sigma ^{2})}\).</p>

<p>Let us understand this by code.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">()</span>

<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="p">.</span><span class="mi">5</span>
<span class="n">lognorm_samples</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="n">lognormal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="mi">10000</span><span class="p">)</span>
<span class="c1"># take the log of lognorm samples to derive the normal dist.
</span><span class="n">norm_samples</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">lognorm_samples</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">norm_samples</span><span class="p">.</span><span class="n">mean</span><span class="p">(),</span> <span class="n">norm_samples</span><span class="p">.</span><span class="n">std</span><span class="p">())</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>5.005339216906491 0.4934326302969564
</code></pre></div></div>

<p>The parameters (mean and std) of the derived normal distribution (line 7) are the same as the original parameters we provided to the lognormal dist (line 5).</p>

<details close="">
<summary>Code to generate the below plots</summary>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code"><pre><span class="c1"># log normal dist
</span><span class="n">fig1</span><span class="p">,</span> <span class="n">ax1</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">lognorm_samples</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"orange"</span><span class="p">)</span>

<span class="n">x1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">800</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span>
<span class="n">lognorm_d</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">lognorm</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">mu</span><span class="p">))</span>
<span class="n">lognorm_y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">lognorm_d</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x1</span><span class="p">])</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="n">lognorm_y</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">mu</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">sigma</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="c1"># normal dist
</span><span class="n">fig2</span><span class="p">,</span> <span class="n">ax2</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">norm_samples</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"orange"</span><span class="p">)</span>

<span class="n">x2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span>
<span class="n">norm_d</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="n">norm_y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">norm_d</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x2</span><span class="p">])</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x2</span><span class="p">,</span> <span class="n">norm_y</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">mu</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">sigma</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">fig1</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'lognorm_dist2.svg'</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'svg'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
<span class="n">fig2</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'norm_dist2.svg'</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'svg'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<figure class="half lognormal_to_normal gallery-popup">
  
  
  <a href="/assets/2024-01/lognorm_dist2.svg" data-count="2" aria-label="Open image 1 of 2">
    <img src="/assets/2024-01/lognorm_dist2.svg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2024-01/norm_dist2.svg" aria-label="Open image 2 of 2">
    <img src="/assets/2024-01/norm_dist2.svg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <figcaption>Lognormal to Normal conversion.</figcaption>
  
</figure>

<p>Conclusion: to convert from a lognormal to normal, take the logarithm of the lognormal sample.</p>

<h2 id="normal-to-lognormal">Normal to Lognormal<a name="normal2log"></a></h2>

<p>If the logarithm of a lognormal distribution is normally distributed, then the reverse will also be true. That is, the exponential of a normal distribution will give us a lognormal distribution. In notation, if \({\displaystyle Y\sim {\mathcal {N}}(\mu ,\sigma ^{2})}\), then \({\ \displaystyle \exp(Y)\sim \operatorname {Lognormal} \left(\mu _{x},\sigma _{x}^{2} \right)\ }\).</p>

<p>Let’s again understand this through code.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">scipy.stats</span> <span class="k">as</span> <span class="n">stats</span>

<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">()</span>

<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="p">.</span><span class="mi">5</span>
<span class="n">norm_samples</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="mi">10000</span><span class="p">)</span>

<span class="c1"># take the exp of norm samples to derive the lognormal dist.
</span><span class="n">lognorm_samples</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">norm_samples</span><span class="p">)</span>

<span class="c1"># fit a lognorm distribution to get the mean and std dev
</span><span class="n">shape</span><span class="p">,</span> <span class="n">loc</span><span class="p">,</span> <span class="n">scale</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">lognorm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">lognorm_samples</span><span class="p">)</span>
<span class="n">mean</span><span class="p">,</span> <span class="n">stddev</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">scale</span><span class="p">),</span> <span class="n">shape</span>
<span class="k">print</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">stddev</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4.984256782660331 0.5067622675605842
</code></pre></div></div>

<p>The parameters (mean and std) of the derived lognormal distribution (line 10) are the same as the original parameters we provided to the normal dist (line 6). Note that we used the [<code class="language-plaintext highlighter-rouge">scipy.stats.lognorm.fit</code>] method to fit the lognorm distribution on the data. It gives us the following three parameters: <code class="language-plaintext highlighter-rouge">loc</code>, <code class="language-plaintext highlighter-rouge">shape</code> and <code class="language-plaintext highlighter-rouge">scale</code>. The <code class="language-plaintext highlighter-rouge">shape</code> is same as standard deviation. To get the mean, we have to take the logarithm of the <code class="language-plaintext highlighter-rouge">scale</code>. We did not have to do this when we converted the lognormal to a normal distribution (previous section) because we can directly get the params (mean and std). Read this <a href="https://stackoverflow.com/a/8748722/2650427">SO answer</a> for more details.</p>

<details close="">
<summary>Code to generate the below plots</summary>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code"><pre><span class="c1"># normal dist
</span><span class="n">fig1</span><span class="p">,</span> <span class="n">ax1</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">norm_samples</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"orange"</span><span class="p">)</span>

<span class="n">x1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span>
<span class="n">norm_d</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="n">norm_y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">norm_d</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x1</span><span class="p">])</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="n">norm_y</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">mu</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">sigma</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="c1"># lognormal dist
</span><span class="n">fig2</span><span class="p">,</span> <span class="n">ax2</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">lognorm_samples</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"orange"</span><span class="p">)</span>

<span class="n">x2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">800</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span>
<span class="n">lognorm_d</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">lognorm</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">mu</span><span class="p">))</span>
<span class="n">lognorm_y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">lognorm_d</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x2</span><span class="p">])</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x2</span><span class="p">,</span> <span class="n">lognorm_y</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"mu = </span><span class="si">{</span><span class="n">mu</span><span class="si">}</span><span class="s">; sigma = </span><span class="si">{</span><span class="n">sigma</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">fig1</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'norm_dist3.svg'</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'svg'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
<span class="n">fig2</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'lognorm_dist3.svg'</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'svg'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

</details>

<figure class="half normal_to_lognormal gallery-popup">
  
  
  <a href="/assets/2024-01/norm_dist3.svg" data-count="2" aria-label="Open image 1 of 2">
    <img src="/assets/2024-01/norm_dist3.svg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2024-01/lognorm_dist3.svg" aria-label="Open image 2 of 2">
    <img src="/assets/2024-01/lognorm_dist3.svg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <figcaption>Normal to Lognormal conversion.</figcaption>
  
</figure>

<p>Conclusion: to convert from a normal to lognormal, take exp of the normal sample.</p>

<h2 id="conclusion">Conclusion<a name="conclusion"></a></h2>

<p>We started with the Normal and Lognormal distributions and with their definition in Python. We converted each of the distributions into the other. It took me some effort to figure out how to do the conversion. With this post, I tried to demystify the confusion.</p>

<p>If you are interested in how other distributions look, your search is over. This <a href="https://stackoverflow.com/q/37559470/2650427">SO answer</a> has visualisations of all the distributions available in <a href="http://docs.scipy.org/doc/scipy/reference/stats.html">scipy.stats</a>.</p>

<p><strong>Update: 18th Jan</strong>: Someone asked me the following question on reddit.</p>

<blockquote>
  <p>For what purpose are you converting between normal and lognormal? The two functions share the same parameters but thats about it. ln(data) is a non-destructive transformation but the process can obscure patterns just as often as it reveals them. Certain advanced statistical tests that require a normal distribution cannot necessarily have the results applied to the lognormal data.</p>
</blockquote>

<p>This stranger is correct that patterns are obscured, or rather, some other patterns come up after log transformation. Although, in my case, it did not matter.</p>

<p>I wanted to match the customers with the items that are within the customer spending range. The formulation was that if I have customer and outlet distributions, then I can match these distributions or get the overlap to get the <em>match percentage</em>. This match percentage will then be used on top of relevance scores.</p>

<p>Looking at the customer’s spend history, I saw that the distribution was lognormally distributed. A similar trend was observed in the restaurant’s order history. Since, computing the overlap in the production env was easier with the normal distributions, I was okay with the conversion. I will cover this in more detail in a future post.</p>]]></content><author><name>Shivam Rana</name></author><category term="ML" /><summary type="html"><![CDATA[The Normal and lognormal distributions are fundamental concepts in statistics. I recently used the relationship between these two distributions in a project. In this blog post, I want to share what I learned.]]></summary></entry><entry><title type="html">Ooty: Friendships, Travel and Painting 📍</title><link href="https://trigonaminima.github.io/2023/12/ooty/" rel="alternate" type="text/html" title="Ooty: Friendships, Travel and Painting 📍" /><published>2023-12-10T00:00:00+00:00</published><updated>2023-12-10T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2023/12/ooty</id><content type="html" xml:base="https://trigonaminima.github.io/2023/12/ooty/"><![CDATA[<p>I spent the last two weeks of August in <a href="https://en.wikipedia.org/wiki/Ooty">Ooty📍</a>, A hill station in Tamil Nadu. Transitioning from Chennai’s heat to Ooty’s cold within a day was drastic. My hoodie was happy to be out from the bottom of my bag.</p>

<p>I stayed at <a href="https://www.zostel.com/zostel/ooty/">Zostel📍</a>. Advice: get the ten-bed dorm. It has access to the balcony. You can see all of Ooty from the balcony. At night, the lit-up Ooty City takes your mind away from everything.</p>

<figure class="third zostel_ooty_views gallery-popup">
  
  
  <a href="/assets/2023-12/ooty_zostel_day.jpg" data-count="3" aria-label="Open image 1 of 3">
    <img src="/assets/2023-12/ooty_zostel_day.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_zostel_evening.jpg" aria-label="Open image 2 of 3">
    <img src="/assets/2023-12/ooty_zostel_evening.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_zostel_night.jpg" aria-label="Open image 3 of 3">
    <img src="/assets/2023-12/ooty_zostel_night.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <figcaption>Views of Ooty from the Zostel Ooty balcony during the day, daybreak, and night.</figcaption>
  
</figure>

<p>On my workations, my stay in any city has taken one of the following tracks: make friends and go with the flow or explore solo. Ooty was the city of friends.</p>

<p>I reached the hostel early in the morning. While I waited for my dorm bed to get cleaned up, I met Arvind and Sriram. All of us were in the same dorm. Enjoying the Ooty view from the balcony, we had an engaging conversation. It was about introduction, backgrounds, work, and <a href="https://venkatesh-rao.gitbook.io/summer-of-protocols/">protocols</a>. Introducing protocols is out of scope for this post, but do follow the shared link. The conversation ended when my time for a work meeting got closer.</p>

<p>There was almost a routine to our days. I get up in the morning and start my work after breakfast in the Hostel Cafe. These guys would chill or go out to some tourist points. I would have lunch and continue working till the evening. After dinner, all three of us and others (new people came and went every other day) would sit around the bonfire and talk. Sometimes we sang and played music. I would also try to produce some sounds with my Ukulele.</p>

<p>One evening, wrapping up early from work, three of us went for a movie. On all my travels, I have watched movies on hostel TVs, but never once have I gone to a movie in a theatre. It was a novelty experience. Let me describe the whole scene.</p>

<p>The movie in focus was <a href="https://en.wikipedia.org/wiki/Blue_Beetle_(film)">Blue Beetle</a> (judge all you want 😅). The only way to book the ticket was at the theatre called <a href="https://maps.app.goo.gl/FgCCGuVZ5BBJksM18">Assembly Rooms📍</a>. We reached there ten minutes early. Including us, there were only five people present. The booking window was not yet open. At the show time, the ticket guy informed us that he needed at least six people to start the show. Our strength was still five. We agreed to buy an extra ticket (each ticket was 180 bucks.) After waiting for five more minutes, he let us watch the movie. The same guy was also the movie operator and started the show. The whole theatre was ours. We watched the senseless movie from the 3rd row and made fun of the various scenes.</p>

<figure class=" theatre gallery-popup">
  
  
  <a href="/assets/2023-12/ooty_theatre.jpg" aria-label="Open image 1 of 1">
    <img src="/assets/2023-12/ooty_theatre.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <figcaption>Assembly Rooms theatre during intermission.</figcaption>
  
</figure>

<p>I had conversations with both of them on a range of topics. The topic of protocols continued. We also talked about green and sustainable energy (Arvind’s forte), their college life (they know each other from college), Peru (Arvind’s base), Ooty’s past (Arvind used to come here from Coimbatore in his childhood), <a href="https://en.wikipedia.org/wiki/Quantified_self">quantified self</a>, music, learning music, tradition of learning music or dance in Tamil families, and a lot more.</p>

<p>One by one, both of them left. It is always sad when someone leaves after you have spent time with them. It is a reality of life, but it is still sad to be left behind. Despite that, you keep going, and then it becomes normal again. You meet new people, and the cycle continues.</p>

<p>Ooty (and the whole of Tamil Nadu) is hard on solo travellers without transport. In every city, I rent a scooty to travel around. Ooty (read: Tamil Nadu) doesn’t allow renting vehicles. The weekend was here. I wanted to see Ooty. So, I asked for the touring taxis at the reception. The taxi guy quoted 2500 bucks for the whole day. It was time to make new friends.</p>

<p>I met Rajesh, who was also looking to share the taxi. Meghna and Gargee were the final two. And our impromptu travel group was ready. Although the taxi driver increased the rate to 3100 INR, it was still better for all four of us.</p>

<p>Rajesh is a PhD student in Geology. He was studying rocks somewhere near Ooty and decided to spend the weekend here. He is more interested in Climate Science and will join another PhD program in a few weeks.</p>

<p>Meghna and Gargee are childhood friends from Gwalior, Madhya Pradesh. Meghna works at an IT company in Pune. Gargee is an architect in Bangalore. She left her job and was moving to Ahmedabad to become a Landscape Architect. The reason for them coming to Ooty was to enjoy Gargee’s last trip from Bangalore.</p>

<p>We went to the following points on our tour.</p>

<ol>
  <li><a href="https://maps.app.goo.gl/BMFaeQzN6RSDq5iX8">Pine Forest📍</a></li>
  <li><a href="https://maps.app.goo.gl/KkzXaSwT9HanFibF8">Lake - Sandynulla📍</a></li>
  <li><a href="https://maps.app.goo.gl/kA6En6csuu4D55Z68">Pykara Lake📍</a></li>
  <li><a href="https://maps.app.goo.gl/vDTucdo5PYwevxg89">Pykara waterfalls📍</a></li>
  <li><a href="https://maps.app.goo.gl/qLXRLytqGQJQ5BHa7">9th-mile shooting point📍</a></li>
</ol>

<figure class="third ooty_tour gallery-popup">
  
  
  <a href="/assets/2023-12/ooty_tour_pine.jpg" title="We first went to the Pine Forest📍. Beware of the monkeys if you are eating something in front of them." data-count="6" aria-label="We first went to the Pine Forest📍. Beware of the monkeys if you are eating something in front of them.">
    <img src="/assets/2023-12/ooty_tour_pine.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_tour_sandynulla.jpg" title="We walked down through the pine trees to the adjacent lake called Sandynulla📍" aria-label="We walked down through the pine trees to the adjacent lake called Sandynulla📍">
    <img src="/assets/2023-12/ooty_tour_sandynulla.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_tour_pykara_lake.jpg" title="Pykara lake where we also did boating." aria-label="Pykara lake where we also did boating.">
    <img src="/assets/2023-12/ooty_tour_pykara_lake.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_tour_pykara_waterfall.jpg" title="Pykara waterfalls" aria-label="Pykara waterfalls">
    <img src="/assets/2023-12/ooty_tour_pykara_waterfall.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_tour_9th_mile.jpg" title="9th mile shooting point" aria-label="9th mile shooting point">
    <img src="/assets/2023-12/ooty_tour_9th_mile.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_tour_all_4_of_us.jpg" title="All four of us at the 9th mile shooting point" aria-label="All four of us at the 9th mile shooting point">
    <img src="/assets/2023-12/ooty_tour_all_4_of_us.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <figcaption>Tour of Ooty.</figcaption>
  
</figure>

<p>It took us till evening to cover all these points. We asked our cab driver to drop us in the city market. It was a full-moon night. We checked out some shops in the market, ate our dinner at <a href="https://maps.app.goo.gl/7CFVgPrGADG1daaKA">Adyar Ananda Bhavan - A2B 🍽️</a> and called it a night after warming up around the bonfire.</p>

<p>The morning, I accompanied Meghna and Gargee to the Bus Depot. They were leaving for Wayanad. After seeing them off, I went to explore the city. I saw a watch tower. On my way, I saw scores of people entering and leaving an alleyway. My curiosity made me go into the alley, and I saw a complete change in the landscape. I was standing in a fruit and veggie market. There were multiple alleys leading to different sections of the market: fruits, veggies, meat, dry fruits, flowers, and other grocery stuff. It felt like the <a href="https://harrypotter.fandom.com/wiki/Diagon_Alley">Diagon Alley</a> in the Harry Potter universe.</p>

<figure class="half ooty_market_area gallery-popup">
  
  
  <a href="/assets/2023-12/ooty_watch_tower.jpg" title="Watch tower on my way to the Ooty municipal market." data-count="2" aria-label="Watch tower on my way to the Ooty municipal market.">
    <img src="/assets/2023-12/ooty_watch_tower.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_market2.jpg" title="One of the streets in the Ooty municipal market." aria-label="One of the streets in the Ooty municipal market.">
    <img src="/assets/2023-12/ooty_market2.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<figure class="half ooty_market_map gallery-popup">
  
  
  <a href="/assets/2023-12/ooty_market_map.png" title="The Ooty municipal market area on the map." data-count="2" aria-label="The Ooty municipal market area on the map.">
    <img src="/assets/2023-12/ooty_market_map.png" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_market1.jpg" title="One of the many flower shops in the Ooty municipal market." aria-label="One of the many flower shops in the Ooty municipal market.">
    <img src="/assets/2023-12/ooty_market1.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<p>After coming back, Mugdha and Madhvi were painting the wall art in the common area.</p>

<p>A little backstory. My dorm was right next to the common area. On my first day in the hostel, the staff cleared and re-painted a wall in the common room for a new wall art. Mugdha came a few days later to pencil the outline. The outline consisted of the elements of Ooty: coffee plantation, toy train, <a href="https://en.wikipedia.org/wiki/Toda_people">toda tribe</a>, rose garden, and pine forest. During the weekend, Madhvi arrived. Both of them started colouring the sketch. We became friends, and I started my <em>apprenticeship</em> under them. Soon it became a group of seven: Mugdha, Madhavi, Anshul, Sumit, Gautam, Vani, and me.</p>

<p>So, I started helping them with colouring after returning from the market. The last time I held a paintbrush was in the 9th grade, more than 14 years ago. I coloured different shades of roses, coffee beans, leaves, and grasslands. I enjoyed filling up those shapes using a paintbrush. I first added a colour coat without worrying about the brush strokes. Later, I painted over it to make it consistent. I aligned the strokes with the outline to make it coherent. I experimented with multiple ways of moving my brush. Slowly, I could achieve the same effect in fewer steps and less paint. It was a very calming activity. I used to leave more delicate stuff for my teachers. By the end, both of my teachers were proud of my work. 😁</p>

<p>Along the way, all of us talked extensively. Both Mugdha and Madhvi are from Mumbai and friends from college. Mugdha is into art, and Madhvi likes textile design. I learnt how students learn in art schools in India. Madhvi had been working in Design Thinking for school kids. She is going to get into textile design next. Mugdha is going to experiment more with painting and colours. One morning, Mugdha introduced me to <a href="https://www.youtube.com/watch?v=rLJlppzru4Q">Aahatein by Agnee</a>. (This song was at the top of my Spotify Wrapped this year.) We discussed similar songs. Madhvi showed her skills on my Ukulele. I also got to know about the <a href="https://www.biennialfoundation.org/biennials/kochi-muziris-biennale-india/">Kochi-Muziris Biennale</a>. I witnessed the traces of previous iterations of Biennale when I went to Kochi a few weeks later.</p>

<p>Anshul, who works on APIs for clients, was trying to explain to Madhvi and Mugdha what he does. While trying to help him explain, we learnt that many women, like Madhvi and Mugdha, enjoy having nerdy talks and would date such men. Then the topic moved to interesting or weird dates many of us have experienced.</p>

<p>Gautam played a ten-minute movie called Zima Blue from the <a href="https://en.wikipedia.org/wiki/Love,_Death_%26_Robots?useskin=vector#Episodes">Love, Death &amp; Robots</a> animation series. Gautam is a volunteer teacher in a nearby village. He is visiting India for a few months, following which he’ll return to the US. Sumit was on a road trip on his KTM and headed back home to Pune from Ooty. We talked about many of his road trips. Vani was on a weekend trip from Bangalore.</p>

<figure class="third ooty_zostel_painting gallery-popup">
  
  
  <a href="/assets/2023-12/ooty_artist1.jpg" title="Mugdha colouring the train with my initials - SR. SR stands for both Shivam Rana, and lesser known, Southern Railways. 😛) Thanks for the tribute guys. Sorry Madhvi, I don’t have a pic with both of you working on it together." data-count="5" aria-label="Mugdha colouring the train with my initials - SR. SR stands for both Shivam Rana, and lesser known, Southern Railways. 😛) Thanks for the tribute guys. Sorry Madhvi, I don’t have a pic with both of you working on it together.">
    <img src="/assets/2023-12/ooty_artist1.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_artists2.jpg" title="Both the artists at work on leaving me a souvenir on my ukulele bag. Thank you, guys." aria-label="Both the artists at work on leaving me a souvenir on my ukulele bag. Thank you, guys.">
    <img src="/assets/2023-12/ooty_artists2.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_colors.jpg" title="All the colours being used in the artwork." aria-label="All the colours being used in the artwork.">
    <img src="/assets/2023-12/ooty_colors.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_painting_group1.jpg" title="Picture taken at Gautam’s (second from the left) goodbye." aria-label="Picture taken at Gautam’s (second from the left) goodbye.">
    <img src="/assets/2023-12/ooty_painting_group1.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_painting_group2.jpg" title="Picture taken at my goodbye." aria-label="Picture taken at my goodbye.">
    <img src="/assets/2023-12/ooty_painting_group2.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
</figure>

<p>I enjoyed painting so much that I extended my stay by a few more days. I, unfortunately, had to say goodbye to everyone before it was complete.</p>

<p>From the hostel, I took the bus to Conoor. And from Conoor📍, I took the <a href="https://ootytourism.co.in/ooty-toy-train-mountain-railway">toy train</a> to Mettupalayam📍.</p>

<p>Fortunately, I got the window seat assigned to me. The train - running on steam - passed through multiple bridges, coffee plantations, and dark tunnels opening up to beautiful valley views. The train stops at two railway stations along the way. Both the stations had an <em>old</em> vibe to them. On the second stop, they also fueled up the engine with water.</p>

<figure class="third ooty_toy_train gallery-popup">
  
  
  <a href="/assets/2023-12/ooty_toy_train.jpg" title="Ooty Toy Train" data-count="3" aria-label="Ooty Toy Train">
    <img src="/assets/2023-12/ooty_toy_train.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_toy_train2.jpg" title="Ooty Toy Train" aria-label="Ooty Toy Train">
    <img src="/assets/2023-12/ooty_toy_train2.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <a href="/assets/2023-12/ooty_toy_train_engine.jpg" title="Ooty Toy Train engine" aria-label="Ooty Toy Train engine">
    <img src="/assets/2023-12/ooty_toy_train_engine.jpg" alt="" loading="lazy" decoding="async" />
  </a>
  
  
  <figcaption>Ooty Toy Train</figcaption>
  
</figure>

<p>At the end of this journey, I took another bus to Coimbatore📍 and reached my hotel (no hostels in Coimbatore ☹️). I met with two people here: Arvind and Guhan. Arvind, whom I had met at the beginning of this post, returned to Coimbatore after leaving Ooty. Guhan is a friend I made in Hampi earlier this year. The story of Coimbatore will continue in another post.</p>]]></content><author><name>Shivam Rana</name></author><category term="Travel" /><summary type="html"><![CDATA[I spent the last two weeks of August in Ooty📍, A hill station in Tamil Nadu. Transitioning from Chennai’s heat to Ooty’s cold within a day was drastic. My hoodie was happy to be out from the bottom of my bag.]]></summary></entry><entry><title type="html">Visualizing a GroupBy (or a Bipartite Graph)</title><link href="https://trigonaminima.github.io/2023/11/bipartite-viz/" rel="alternate" type="text/html" title="Visualizing a GroupBy (or a Bipartite Graph)" /><published>2023-11-21T00:00:00+00:00</published><updated>2023-11-21T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2023/11/bipartite-viz</id><content type="html" xml:base="https://trigonaminima.github.io/2023/11/bipartite-viz/"><![CDATA[<p>Have you ever needed to present the output of a <a href="https://en.wikipedia.org/wiki/Group_by_(SQL)">GroupBy</a> or <a href="https://en.wikipedia.org/wiki/Pivot_table">Pivot Table</a>?</p>

<p>Will you display it as a table? Not everyone can grok it. It will also take time to walk people through the table.</p>

<p>You will format your table with colours (<a href="https://support.microsoft.com/en-au/office/highlight-patterns-and-trends-with-conditional-formatting-eea152f5-2a7d-4c1a-a2da-c5f893adb621">conditional</a> <a href="https://support.google.com/docs/answer/78413?hl=en&amp;co=GENIE.Platform%3DDesktop">formatting</a>) to show peaks and bottoms. That will work. However, it will become dense as number of rows increase. Furthermore, this workflow involves exporting data from your respective data system (database or data lake) and importing it to Excel/Google Sheets. Thus, it is not feasible in all situations. One of those situations is what I faced.</p>

<p>I was doing an analysis of customer data at work. I wanted to see the distribution of cuisines in two subsequent orders. For example, the customer ordered Chinese food followed by South Indian in the next order. Because sequence matters for my analysis, Chinese to South Indian and South Indian to Chinese would be two separate rows. As you can imagine, a significant part of the GroupBy output contained these redundant pairs. It was difficult to derive any insights from it.</p>

<h2 id="bipartite-graphs-to-the-rescue">Bipartite Graphs to the Rescue</h2>

<p>Fortunately for me, I was able to recall the <a href="https://en.wikipedia.org/wiki/Bipartite_graph?useskin=vector">bipartite graphs</a>. Bipartite graphs model the relationship between two classes of objects. For example, think about the relationship between owners and their cars. An owner can own ore or more cars. An owner can not own other owners. Similarly, a car can not own other cars. A bipartite graph will only show a relationship between a vehicle and its owner (two different classes of objects).</p>

<p>It was perfect for my visualisation problem at hand!</p>

<p>However, to generate a presentable graph turned out to be slightly roundabout. This article is to document the process for my future self.</p>

<h2 id="the-process">The Process</h2>

<p>As expected, the <a href="https://networkx.org/">NetworkX</a> Python library had all the utilities available. The steps are as follows:</p>

<ol>
  <li>Get data</li>
  <li>Define a <a href="https://networkx.org/documentation/stable/reference/classes/graph.html#networkx.Graph"><code class="language-plaintext highlighter-rouge">networkx Graph</code></a>.</li>
  <li>Use <a href="https://networkx.org/documentation/stable/reference/generated/networkx.drawing.layout.bipartite_layout.html"><code class="language-plaintext highlighter-rouge">bipartite_layout()</code></a> to define the layout for a bipartite graph.</li>
  <li>Draw the graph using <a href="https://networkx.org/documentation/latest/reference/generated/networkx.drawing.nx_pylab.draw.html#networkx.drawing.nx_pylab.draw"><code class="language-plaintext highlighter-rouge">draw()</code></a>.</li>
</ol>

<p>There are more minor steps involved that we will cover during the deep dive. Since NetworkX plays well with the <a href="https://matplotlib.org/">Matplotlib</a> library, we have all the Matplotlib utilities available to us.</p>

<p>I will visualise the <a href="https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-leading-causes-of-death">age-wise top causes of death</a> according to WHO.</p>

<p>We start with the necessary imports.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">networkx</span> <span class="k">as</span> <span class="n">nx</span>

<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>We have to pre-process the data for the viz.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"male.csv"</span><span class="p">).</span><span class="n">set_index</span><span class="p">(</span><span class="s">"cod"</span><span class="p">).</span><span class="n">T</span>
<span class="n">data</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">"cod_"</span><span class="o">+</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">data</span><span class="p">.</span><span class="n">columns</span><span class="p">]</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">rename_axis</span><span class="p">(</span><span class="s">'age_group'</span><span class="p">).</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">wide_to_long</span><span class="p">(</span>
    <span class="n">data</span><span class="p">,</span> <span class="n">stubnames</span><span class="o">=</span><span class="s">"cod"</span><span class="p">,</span> <span class="n">i</span><span class="o">=</span><span class="p">[</span><span class="s">'age_group'</span><span class="p">],</span> <span class="n">j</span><span class="o">=</span><span class="s">"cause"</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">'_'</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="sa">r</span><span class="s">'[\w ,]+'</span>
<span class="p">)</span>
<span class="n">data</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">"percent"</span><span class="p">]</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">data</span><span class="p">[</span><span class="s">"percent"</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">"percent"</span><span class="p">].</span><span class="nb">str</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span><span class="o">/</span><span class="mi">100</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">data</span><span class="p">.</span><span class="n">cause</span> <span class="o">!=</span> <span class="s">"All Causes"</span><span class="p">]</span>
<span class="n">data</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>The data is ready. I wanted all the edges with the same start in the same colour. So I added an integer corresponding to each class using the below code. We will use this column to get a random colour for each label with a <a href="https://stackoverflow.com/q/14720331/2650427">colour map</a>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="c1"># colors
</span><span class="n">node_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">([(</span><span class="n">j</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'age_group'</span><span class="p">].</span><span class="n">unique</span><span class="p">())])</span>
<span class="n">data</span><span class="p">[</span><span class="s">"node_color"</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">"age_group"</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">node_dict</span><span class="p">[</span><span class="n">x</span><span class="p">])</span>
<span class="n">data</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<figure class="single-image-popup">
  <a href="/assets/2023-11/groupby1.png" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-11/groupby1.png" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p>I am loading the data and converting the <a href="https://en.wikipedia.org/wiki/Wide_and_narrow_data">wide to the long format</a> for NetworkX. Next, we define our graph using this data.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">edges</span> <span class="o">=</span> <span class="p">[</span><span class="nb">tuple</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[[</span><span class="s">'age_group'</span><span class="p">,</span> <span class="s">'cause'</span><span class="p">]].</span><span class="n">values</span><span class="p">.</span><span class="n">tolist</span><span class="p">()]</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">()</span>
<span class="n">B</span><span class="p">.</span><span class="n">add_nodes_from</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'age_group'</span><span class="p">].</span><span class="n">unique</span><span class="p">(),</span> <span class="n">bipartite</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">B</span><span class="p">.</span><span class="n">add_nodes_from</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'cause'</span><span class="p">].</span><span class="n">unique</span><span class="p">(),</span> <span class="n">bipartite</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">B</span><span class="p">.</span><span class="n">add_edges_from</span><span class="p">(</span><span class="n">edges</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Below is how we visualise the graph.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="c1"># matplotlib variables
</span><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="n">fig</span><span class="p">.</span><span class="n">set_size_inches</span><span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>

<span class="c1"># First specify the nodes we want on left or top
# create a bipartite layout
</span><span class="n">left_or_top</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">'age_group'</span><span class="p">].</span><span class="n">unique</span><span class="p">()[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">pos</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">bipartite_layout</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">left_or_top</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>

<span class="c1"># Pass that layout to nx.draw
</span><span class="n">nx</span><span class="p">.</span><span class="n">draw</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">pos</span><span class="p">,</span> <span class="n">node_color</span><span class="o">=</span><span class="s">'#A0CBE2'</span><span class="p">,</span> <span class="n">edge_color</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>We define Matplotlib variables. Use <code class="language-plaintext highlighter-rouge">bipartite_layout</code> to get the required layout and draw the graph. Note that, without <code class="language-plaintext highlighter-rouge">edge_color="white"</code>, we can <a href="https://stackoverflow.com/a/54549650/2650427">stop at this step</a>. We will get equal width, constant colour edges and nodes. The next few steps will fix the presentation aspect of the plot.</p>

<p>We colour the edges first.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="c1"># define random color map - https://stackoverflow.com/a/68459848/2650427
</span><span class="n">colors_</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">n</span><span class="p">:</span> <span class="nb">list</span><span class="p">(</span>
    <span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">i</span><span class="p">:</span> <span class="s">"#"</span> <span class="o">+</span> <span class="s">"%06x"</span> <span class="o">%</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mh">0xFFFFFF</span><span class="p">),</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span>
<span class="n">colors</span> <span class="o">=</span> <span class="n">colors_</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">age_group</span><span class="p">.</span><span class="n">unique</span><span class="p">()))</span>

<span class="c1"># draw each edge
</span><span class="n">edge_width_dict</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">data</span><span class="p">[[</span><span class="s">'age_group'</span><span class="p">,</span> <span class="s">"cause"</span><span class="p">,</span> <span class="s">"percent"</span><span class="p">]]</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">'age_group'</span><span class="p">,</span> <span class="s">"cause"</span><span class="p">])</span>
<span class="p">)</span>
<span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[[</span><span class="s">'age_group'</span><span class="p">,</span> <span class="s">"node_color"</span><span class="p">]].</span><span class="n">drop_duplicates</span><span class="p">().</span><span class="n">values</span><span class="p">:</span>
    <span class="n">edges</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">edges</span><span class="p">([</span><span class="n">node</span><span class="p">[</span><span class="mi">0</span><span class="p">]])</span>
    <span class="n">color</span> <span class="o">=</span> <span class="n">colors</span><span class="p">[</span><span class="n">node</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span>
    <span class="n">edge_widths</span> <span class="o">=</span> <span class="p">[</span><span class="n">edge_width_dict</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s">"percent"</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">edges</span><span class="p">]</span>
    <span class="n">nx</span><span class="p">.</span><span class="n">draw_networkx_edges</span><span class="p">(</span>
        <span class="n">B</span><span class="p">,</span>
        <span class="n">pos</span><span class="p">,</span>
        <span class="n">edgelist</span><span class="o">=</span><span class="n">edges</span><span class="p">,</span>
        <span class="n">width</span><span class="o">=</span><span class="n">edge_widths</span><span class="p">,</span>
        <span class="n">edge_color</span><span class="o">=</span><span class="n">color</span><span class="p">,</span>
    <span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>We iterate through all the starting nodes and their corresponding colours. We get each point and its edges and colour them the same but vary their width according to the <code class="language-plaintext highlighter-rouge">percent</code> column.</p>

<p>Last configuration is the node labels and their alignment. Without this segment, all the node labels would be centre-aligned. A long string is truncated in the viz. I want to point out that neither the documentation nor Stack Overflow could help me here. My saviour was ChatGPT. It gave me a working example using <code class="language-plaintext highlighter-rouge">draw_networkx_labels()</code> that I modified as below.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="c1"># left node labels alignment
</span><span class="k">for</span> <span class="n">node_name</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[</span><span class="s">'age_group'</span><span class="p">].</span><span class="n">drop_duplicates</span><span class="p">().</span><span class="n">values</span><span class="p">:</span>
    <span class="n">node</span> <span class="o">=</span> <span class="p">{</span><span class="n">node_name</span><span class="p">:</span> <span class="n">node_name</span><span class="p">}</span>
    <span class="n">node_pos</span> <span class="o">=</span> <span class="p">{</span><span class="n">node_name</span><span class="p">:</span> <span class="n">pos</span><span class="p">[</span><span class="n">node_name</span><span class="p">]}</span>
    <span class="n">label_pos</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">draw_networkx_labels</span><span class="p">(</span>
        <span class="n">B</span><span class="p">,</span> <span class="n">node_pos</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">node</span><span class="p">,</span> <span class="n">font_size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">horizontalalignment</span><span class="o">=</span><span class="s">'left'</span><span class="p">,</span>
        <span class="n">verticalalignment</span><span class="o">=</span><span class="s">"bottom"</span>
    <span class="p">)</span>

<span class="c1"># right node labels alignment
</span><span class="k">for</span> <span class="n">node_name</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[</span><span class="s">'cause'</span><span class="p">].</span><span class="n">drop_duplicates</span><span class="p">().</span><span class="n">values</span><span class="p">:</span>
    <span class="n">node</span> <span class="o">=</span> <span class="p">{</span><span class="n">node_name</span><span class="p">:</span> <span class="n">node_name</span><span class="p">}</span>
    <span class="n">node_pos</span> <span class="o">=</span> <span class="p">{</span><span class="n">node_name</span><span class="p">:</span> <span class="n">pos</span><span class="p">[</span><span class="n">node_name</span><span class="p">]}</span>
    <span class="n">label_pos</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">draw_networkx_labels</span><span class="p">(</span>
        <span class="n">B</span><span class="p">,</span> <span class="n">node_pos</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">node</span><span class="p">,</span> <span class="n">font_size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">horizontalalignment</span><span class="o">=</span><span class="s">'right'</span><span class="p">,</span>
        <span class="n">verticalalignment</span><span class="o">=</span><span class="s">"bottom"</span>
    <span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</pre></td></tr></tbody></table></code></pre></figure>

<h2 id="our-beautiful-plots">Our Beautiful Plots</h2>

<p>Time to see the results.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-11/groupby_male.svg" style="text-align: center; margin: auto" title="Age-wise causes of death in males" aria-label="Age-wise causes of death in males">
    <img src="/assets/2023-11/groupby_male.svg" loading="lazy" decoding="async" />
  </a>
  
  <figcaption style="text-align: center; margin: auto">Age-wise causes of death in males</figcaption>
  
</figure>

<p>Male children mostly die due to Infectious and parasitic diseases, Respiratory infections, Maternal conditions, Neonatal conditions, and Nutritional deficiencies. Most teen and youth deaths (15-29 years in age) happen due to injuries. As men get old, serious ailments (Birth ailments, Cancer, Cardiovascular, Respiratory, and others) become more pronounced causes of death.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-11/groupby_female.svg" style="text-align: center; margin: auto" title="Age-wise causes of death in females" aria-label="Age-wise causes of death in females">
    <img src="/assets/2023-11/groupby_female.svg" loading="lazy" decoding="async" />
  </a>
  
  <figcaption style="text-align: center; margin: auto">Age-wise causes of death in females</figcaption>
  
</figure>

<p>Females follow a similar distribution. One notable difference is that relatively few women die due to injuries. Is that the reason women live longer than men?</p>

<p>The plots effectively showed the common diseases for each age group. Of course, this plot only gives a summary. And the summary is what we wanted from this viz.</p>

<h2 id="shortcomings">Shortcomings</h2>

<p>The plots were 90% there. Unfortunately, there are a few flaws.</p>

<p>While it provides me with a summary, it does not tell me the strength of the relationship. In that aspect, it is similar to pie charts. And the internet is filled with articles about why pie charts are unhelpful plots.</p>

<p>Another issue is the random colour and edge width assigned to each edge. A node may be yellowish-green in colour. Even if the edge width is relatively higher, the edge will still not be prominent. I re-ran my code to get the version with the right colours. We could solve this by hand-selecting the colours and tuning the edge widths with a constant factor.</p>

<h2 id="conclusion">Conclusion</h2>

<p>We wanted a summary visualisation of our GroupBy (or pivot table) output. To achieve that, we converted it into a bipartite graph and rendered it using Matplotlib.</p>

<p>There are flaws in this visualisation. The strength of the relationship is not apparent. Additionally, edge colour and widths need tuning to make the strong relationships prominent. Fixing these issues is a future work.</p>]]></content><author><name>Shivam Rana</name></author><category term="Viz" /><summary type="html"><![CDATA[Have you ever needed to present the output of a GroupBy or Pivot Table?]]></summary></entry><entry><title type="html">[Mini] How to Parse JSON in Spark without Knowing the Schema?</title><link href="https://trigonaminima.github.io/2023/07/pyspark-json-parse/" rel="alternate" type="text/html" title="[Mini] How to Parse JSON in Spark without Knowing the Schema?" /><published>2023-07-08T00:00:00+00:00</published><updated>2023-07-08T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2023/07/pyspark-json-parse</id><content type="html" xml:base="https://trigonaminima.github.io/2023/07/pyspark-json-parse/"><![CDATA[<h2 id="problem-statement">Problem Statement</h2>

<p>I have a JSON column in my DataFrame.</p>

<ul>
  <li>The JSON is in string format.</li>
  <li>It is a nested JSON.</li>
  <li>It is a large string.</li>
  <li>I do not know the schema and want to avoid defining it manually.</li>
  <li>All the JSONs follow the same schema definition.</li>
</ul>

<p>I need to format it as a JSON object (<a href="https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/types/StructType.html"><code class="language-plaintext highlighter-rouge">struct</code></a>) to extract anything out of it. How do I convert it into a <code class="language-plaintext highlighter-rouge">struct</code>?</p>

<h2 id="solution">Solution</h2>

<p>Here is the solution if you are short on time. In the next section, I discuss it in more detail.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre><span class="c1"># Spark 3.2.1 | Scala 2.12
</span><span class="kn">import</span> <span class="nn">pyspark.sql.functions</span> <span class="k">as</span> <span class="n">F</span>

<span class="c1"># Sample json we will work with.
</span><span class="n">sample_json</span> <span class="o">=</span> <span class="s">"""
{
  "lvl1":  {
    "lvl2a": {
      "lvl3a":   {
        "lvl4a": "random_data",
        "lvl4b": "random_data"
      }
    },
    "lvl2b":   {
      "lvl3a":   {
        "lvl4a": "ramdom_data"
      },
      "lvl3b":  [
        {"lvl4a": "random_data"},
        {"lvl4b": "random_data"}
      ]
    }
  }
}
"""</span>

<span class="c1"># Spark dataframe with json column
</span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="n">sample_json</span><span class="p">,)]</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="p">[</span><span class="s">"json_data"</span><span class="p">])</span>

<span class="c1"># determine the schema
</span><span class="n">json_schema</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">schema_of_json</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">"json_data"</span><span class="p">)).</span><span class="n">first</span><span class="p">()[</span><span class="mi">0</span><span class="p">])</span>

<span class="c1"># converting json to struct
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s">"json_data_struct"</span><span class="p">,</span> <span class="n">F</span><span class="p">.</span><span class="n">from_json</span><span class="p">(</span><span class="s">"json_data"</span><span class="p">,</span> <span class="n">json_schema</span><span class="p">))</span>
</pre></td></tr></tbody></table></code></pre></figure>

<h2 id="details">Details</h2>

<p>We will use <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.schema_of_json.html"><code class="language-plaintext highlighter-rouge">pyspark.sql.functions.schema_of_json</code></a> to do our dirty work of determining the schema.</p>

<p>Just like any other column-based function, I expected this function to work on a column. So I tried this as below:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s">"sch"</span><span class="p">,</span> <span class="n">F</span><span class="p">.</span><span class="n">schema_of_json</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">"json_data"</span><span class="p">)))</span></code></pre></figure>

<p>It threw the below error:</p>

<figure class="highlight"><pre><code class="language-shell" data-lang="shell">AnalysisException: cannot resolve <span class="s1">'schema_of_json(json_data)'</span> due to data <span class="nb">type </span>mismatch: The input json should be a foldable string expression and not null<span class="p">;</span> however, got json_data.<span class="p">;</span>
...</code></pre></figure>

<p>I did not know what is a <em>foldable string</em>. The data type of the <code class="language-plaintext highlighter-rouge">json_data</code> column was a string. The ChatGPT also suggested the same way of using this function. :)</p>

<p>The <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.schema_of_json.html">documentation</a> and multiple Stack Overflow answers [<a href="https://stackoverflow.com/a/64032076/2650427">1</a>, <a href="https://stackoverflow.com/a/64077996/2650427">2</a>, <a href="https://stackoverflow.com/a/59143129/2650427">3</a>] helped me reach an explanation.</p>

<p>The <code class="language-plaintext highlighter-rouge">schema_of_json</code> needs a single string instead of a column. So I extracted one JSON string from the column and passed it to the function. This is how I did it:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
</pre></td><td class="code"><pre><span class="n">json_string</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">"json_data"</span><span class="p">)).</span><span class="n">first</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">json_schema</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">schema_of_json</span><span class="p">(</span><span class="n">json_string</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>The end.</p>

<p><br /></p>]]></content><author><name>Shivam Rana</name></author><summary type="html"><![CDATA[Problem Statement]]></summary></entry><entry><title type="html">Fitness Dashboard with Google Fit</title><link href="https://trigonaminima.github.io/2023/06/google-fit-data/" rel="alternate" type="text/html" title="Fitness Dashboard with Google Fit" /><published>2023-06-09T00:00:00+00:00</published><updated>2023-06-09T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2023/06/google-fit-data</id><content type="html" xml:base="https://trigonaminima.github.io/2023/06/google-fit-data/"><![CDATA[<p>In this post, I will describe my Google Sheets dashboard, where I track all fitness-related aspects.</p>

<p>I like self-tracking. I want to track my productivity and find potential improvements. What gets tracked also gets measured. My interest in the <a href="https://en.wikipedia.org/wiki/Quantified_self">Quantified self</a> has evolved.</p>

<p>I started with <a href="/2014/11/gamification-of-life/">Gamification of Life</a>, where I assigned points to everything I did. It got too overwhelming after a year.</p>

<p>I lack at maintaining relationships with friends and family. For insights, I analysed my chats’ metadata - <a href="/2016/06/chatting-up/">Chatting Up</a> and <a href="/2018/04/chatting-up-2/">Chatting Up - Part II</a>. It was interesting to see how my interactions changed over time with friends. I also wanted to play with the chat content, but NLP capabilities at that time weren’t enough to deal with <a href="/2018/06/hinglish-and-transliteration/">Hinglish</a> text.</p>

<p>I track my call logs and someday would like to analyse them.</p>

<p>Two years back, I even started making an app to track anything. You can read more about the app here:</p>

<ul>
  <li><a href="/2021/07/flutter_app_1/">Minutes - A Quantified Self App for Myself</a></li>
  <li><a href="/2021/08/flutter_app_2/">Minutes - Building the Settings Page</a></li>
  <li><a href="/2021/08/flutter_app_3/">Minutes - Building the History Page</a></li>
</ul>

<p>Eventually, other commitments and curiosities caught up, and I couldn’t finish it. 😅</p>

<p>I also track my spending and habits. I am sure I have skipped a few more.</p>

<h2 id="tracking-fitness">Tracking Fitness</h2>

<p>Fitness was another frontier where I tried many things. Tracking was always enabled using Google Fit and Maps, but I did not do anything with the data. Google Fit analytics on the app was helpful, but I wanted more. Time was difficult to find between work, travel, and other interests. I wanted something quick and easy to develop/maintain.</p>

<p>Introducing: Do More With Less (DMWL). It is a trend at work where you identify and prioritize the tasks that are quick to execute with good ROI.</p>

<p>I googled previous work in this direction. I found <a href="https://towardsdatascience.com/how-i-built-a-google-spreadsheet-to-keep-track-of-google-fit-fitness-data-a0887a59f730">this medium article</a> that pointed to this more helpful article doing what I wanted - <a href="https://ithoughthecamewithyou.com/post/export-google-fit-daily-steps-to-a-google-sheet">Export Google Fit Daily Steps, Weight and Distance to a Google Sheet</a>[code from the blog is available here: <a href="https://github.com/abfo/google-fit-to-sheets/blob/master/Code.gs">google-fit-to-sheets/Code.gs</a>]. It made almost everything straightforward - setting up the app, auth, pulling and formatting data. Chat GPT complemented my lack of knowledge of Javascript to write code in Google Sheets.</p>

<p><strong>Update: 19th July</strong></p>

<p>I had to do set up the credentials multiple times now. Add the steps here so quick reference. In depth instructions are on <a href="https://github.com/googleworkspace/apps-script-oauth2">apps-script-oauth2/README.md</a>.</p>

<ol>
  <li>Open script editor by going to Extensions &gt; Apps Scrip. It will open a new apps script project.</li>
  <li>Name the project. Click the + in the Libraries section. In the Add a Library dialogue, add <code class="language-plaintext highlighter-rouge">1B7FSrk5Zi6L1rSxxTDgDEUsPzlukDsi4KGuTMorsTQHhGBzBkMun4iDF</code> as the Script ID. This will find the <a href="https://github.com/googleworkspace/apps-script-oauth2">Google OAuth2 Lib</a>. Select the latest version and save.</li>
  <li>Go to Project Properties from the file menu and make a note of the Script ID. This is the ID for our new project. We will need it later.</li>
  <li>Open the <a href="https://accounts.google.com/ServiceLogin?service=cloudconsole&amp;passive=1209600&amp;osid=1&amp;continue=https://console.cloud.google.com/apis/dashboard&amp;followup=https://console.cloud.google.com/apis/dashboard">Google API Console</a>.</li>
  <li>Create a new project and name it.</li>
  <li>Go to Enable APIs and Services and find the <strong>Fitness API</strong>.</li>
  <li>Go to Keys and create an OAuth Client ID. While creating the consent screen, only add the product name. Select “Web Application” in the application type. In the redirect URL add <code class="language-plaintext highlighter-rouge">https://script.google.com/macros/d/{SCRIPTID}/usercallback</code> and replace the <code class="language-plaintext highlighter-rouge">{SCRIPTID}</code> with the Script ID copied in step 3. Note down the client id and client secret created at the end. We will use these in our script. [You can read more on <a href="https://support.google.com/cloud/answer/6158849?hl=en">Setting up OAuth 2.0</a>.]</li>
</ol>

<p><strong>Update: 28th July</strong></p>

<p>The script stops working after a few days. This has happened twice with me. I get the following error:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Error: Access not granted or expired.
Service_.getAccessToken
@ Service.gs:518
</code></pre></div></div>

<p>I still haven’t found the solution, but I have a few leads:</p>

<ol>
  <li>I hypothesise that it is related to oauth2 details being stored in the <a href="https://developers.google.com/apps-script/guides/properties">Properties Service</a>. The token is empty when I print the Properties.</li>
  <li>This Properties Store has an expiry (likely 1 hour). I couldn’t find a way to update the oauth2 details before the expiration. Tried multiple ways after <a href="https://benronkin.com/blog/how-to-use-script-properties-in-google-apps-script.html">deleting and setting</a>. This <a href="https://stackoverflow.com/a/71747558/2650427">SO answer</a> didn’t help either.</li>
</ol>

<p><strong>End of the Update</strong></p>

<figure class="single-image-popup">
  <a href="/assets/2023-06/1_fit_dash_logs.png" style="text-align: center; margin: auto" title="Raw data in yellow-shaded cells. The 0.00 values are unlogged days." aria-label="Raw data in yellow-shaded cells. The 0.00 values are unlogged days.">
    <img src="/assets/2023-06/1_fit_dash_logs.png" loading="lazy" decoding="async" />
  </a>
  
  <figcaption style="text-align: center; margin: auto">Raw data in yellow-shaded cells. The 0.00 values are unlogged days.</figcaption>
  
</figure>

<figure class="single-image-popup">
  <a href="/assets/2023-06/2_fit_dash_logs2.png" style="text-align: center; margin: auto" title="Rolling averages are calculated within the sheet." aria-label="Rolling averages are calculated within the sheet.">
    <img src="/assets/2023-06/2_fit_dash_logs2.png" loading="lazy" decoding="async" />
  </a>
  
  <figcaption style="text-align: center; margin: auto">Rolling averages are calculated within the sheet.</figcaption>
  
</figure>

<h2 id="code">Code</h2>

<p>In this section, I will discuss the coding involved. You can skip to the <a href="#fitness-dashboard">next section</a> for the dashboard.</p>

<p>Here is how the code flow is:</p>

<ol>
  <li>Get today’s date.</li>
  <li>Get all the specified metrics for the date. I care about the following specific events: step count, weight, heart points, and all logged activities.
    <ul>
      <li>Step counts: <code class="language-plaintext highlighter-rouge">com.google.step_count.delta</code></li>
      <li>Weight: <code class="language-plaintext highlighter-rouge">com.google.heart_minutes</code></li>
      <li>Heart Points: <code class="language-plaintext highlighter-rouge">com.google.weight.summary</code></li>
      <li>All logged activities: <code class="language-plaintext highlighter-rouge">com.google.activity.segment</code></li>
    </ul>
  </li>
  <li>Set the precision of all the numbers and impute null values with zero.</li>
  <li>Get the spreadsheet object using <a href="https://developers.google.com/apps-script/reference/spreadsheet/spreadsheet-app"><code class="language-plaintext highlighter-rouge">getActiveSpreadsheet()</code></a> and append the data on the last empty row using <a href="https://developers.google.com/apps-script/reference/spreadsheet/sheet#getlastrow"><code class="language-plaintext highlighter-rouge">getLastRow()</code></a>.</li>
  <li>Copy the cell formatting of all the cells from the row before using the <a href="https://developers.google.com/apps-script/reference/spreadsheet/range#copytodestination,-options"><code class="language-plaintext highlighter-rouge">copyTo(destination, options)</code></a> function.</li>
  <li>Copy the rolling avg. formulae from the row before, again using the <code class="language-plaintext highlighter-rouge">copyTo()</code> function.</li>
</ol>

<h3 id="google-fit-api">Google Fit API</h3>

<p>It is dense! Extending it to my signals required scouring over the docs and multiple SO answers.</p>

<p>The first helpful link was: <a href="https://developers.google.com/apis-explorer/#search/fitness.users.datasources.list/m/fitness/v1/fitness.users.dataSources.list?userId=me&amp;_h=1">Users.dataSources: list</a>. It gave me all the data points I can ask for from the Fit API. The next challenge was discovering the schema and what different fields meant. After several hit-n-trials, I found the <a href="https://developers.google.com/fit/rest/v1/reference/activity-types">Activity Types</a> page. It gave me the required ID for each activity.</p>

<p><strong>Update: 28th July</strong>: I stumbled upon the guide to the <a href="https://developers.google.com/fit/rest/v1/get-started">REST API of Fit</a>.</p>

<p>I hope you will find these links useful.</p>

<h3 id="next-features">Next Features</h3>

<p>There are two immediate hurdles I need to cross.</p>

<p>I have to authorize the app every day to call the API. A quick search told me that the token expires after an hour. I have to use a parameter called <code class="language-plaintext highlighter-rouge">expires_in</code> to refresh the token. Unfortunately, I could not figure out how to use it.</p>

<p>Similarly, I have to call the function daily (by pressing a button from the menu). I can automate it through the time-driven (clock) trigger [ref: <a href="https://developers.google.com/apps-script/guides/triggers">triggers</a>]. The problem is the token expiration. The trigger will fail the next day because the token is stale.</p>

<h2 id="fitness-dashboard">Fitness Dashboard</h2>

<p>Time for the final results.</p>

<p>Google Fit gives you a <a href="https://support.google.com/fit/answer/7619539?hl=en&amp;co=GENIE.Platform%3DAndroid#zippy=%2Chow-to-earn-heart-points">Heart Point</a> (HP) for each minute of activity you do. Here is how mine looks. My heart points mainly include walking, working out, and a little bit of swimming.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-06/3_fit_dash_hp.png" style="text-align: center; margin: auto" title="Rolling averages are calculated within the sheet." aria-label="Rolling averages are calculated within the sheet.">
    <img src="/assets/2023-06/3_fit_dash_hp.png" loading="lazy" decoding="async" />
  </a>
  
  <figcaption style="text-align: center; margin: auto">Rolling averages are calculated within the sheet.</figcaption>
  
</figure>

<p>The red line is an aggregated line to make it easy to see the trend.</p>

<p>The yellow-shaded regions highlight the days I was workationing. During these days, my heart points rarely reached zero. Whereas in the non-shaded periods, I frequently hit zero. Those zeroes are my two rest days after five days of working out.</p>

<p>My heart points are also cyclic in nature. Whenever I am home, my only regular activity is exercising with a two-day break every week. Walking becomes an occasional affair.</p>

<p>Let’s look at these heart points more closely.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-06/fit_dash_steps.png" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-06/fit_dash_steps.png" loading="lazy" decoding="async" />
  </a>
  
</figure>

<figure class="single-image-popup">
  <a href="/assets/2023-06/fit_dash_workout.png" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-06/fit_dash_workout.png" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p>Can you notice the complementary nature of the two graphs? First, look at the red line and then focus on blue.</p>

<p>During the travel period, my HP fluctuated because of the crazy number of daily steps (a proxy for walking) and irregular/short workouts. And when I am at home, I go crazy with my exercise. During May, a few things at home led to more walking and small and irregular workouts.</p>

<p>That’s the end of my DMWL version of the Fitness dashboard.</p>

<h2 id="next-steps">Next Steps</h2>

<p>The first stage of my dashboard is complete. I will iteratively update it to get more out of it. I discussed the tech improvements in the coding section. As a part of my fitness tracking journey, here is what I want to do.</p>

<p>I want to come up with some standards or thresholds for myself. It could mean saying something like reaching 30-40 HP daily or working out for fifty minutes five days a week. I do not know what these standards will look like.</p>

<p>I want to use this dashboard to motivate myself. It can be a tracker for fitness-related habits. It should all be automated. Consequently, I want this dashboard to help me inculcate new fitness-related habits. Towards this vision, the next step is to add more activities to the mix, namely meditation, cycling, and swimming.</p>

<p>I live a healthy lifestyle. Another goal of this dashboard is to observe how different decisions in my life impact my health and productivity. That will help me course correct. Some directions are:</p>

<ul>
  <li>What is the impact of my diet on my weight? For example, when does a high-calorie diet typically reflect in my physique? The current heuristic-based answer is two weeks but needs validation from data.</li>
  <li>Sleeping hour vs the workout efficiency the next day. Does routine matter? If yes, how and where?</li>
  <li>How does my fitness change during my travels?</li>
  <li>Does fitness impact my productivity?</li>
</ul>

<p>There are more, but these are the important ones in my mind.</p>

<p><strong>Update: 19th July</strong>: I added meditation and cycling under the activities section. In the nutrition section, I added calories burnt and water intake. Calories burnt is Fit’s approximation. The water intake is tracked manually in the app like the activities.</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-Self" /><summary type="html"><![CDATA[In this post, I will describe my Google Sheets dashboard, where I track all fitness-related aspects.]]></summary></entry><entry><title type="html">Mechanisms for Data Science Projects</title><link href="https://trigonaminima.github.io/2023/05/mechanisms-ds-projects/" rel="alternate" type="text/html" title="Mechanisms for Data Science Projects" /><published>2023-05-12T00:00:00+00:00</published><updated>2023-05-12T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2023/05/mechanisms-ds-projects</id><content type="html" xml:base="https://trigonaminima.github.io/2023/05/mechanisms-ds-projects/"><![CDATA[<blockquote>
  <p>Good intentions never work, you need good mechanisms to make anything happen - Jeff Bezos.</p>
</blockquote>

<p>A mechanism is a process where</p>

<ol>
  <li>You create a <strong>tool</strong>;</li>
  <li>Drive <strong>adoption</strong> of the tool;</li>
  <li><strong>Inspect</strong> to correct course.</li>
</ol>

<p>Read more - <a href="https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/building-mechanisms.html">Building mechanisms</a>.</p>

<p>I have observed that mechanisms work more efficiently than good intentions. Thus, I always try to convert good intentions into mechanisms. Looking at how Data Science projects are not as streamlined as Software Engineering projects, I searched for guidance on managing them better.</p>

<p>The following are my notes from <a href="https://eugeneyan.com/writing/mechanisms-for-projects/">Mechanisms for Effective Machine Learning Projects</a> and related articles. These mechanisms apply to any data science project (work or hobby). Note that these mechanisms are mental tools. You have to make them a habit for mental tools to work.</p>

<h2 id="meta-checklist-for-projects">Meta-Checklist for Projects</h2>

<ul>
  <li>1-pager describing a map to the destination (1-7 days)
    <ul>
      <li>Intent or why. Quantify the problem.</li>
      <li>Desired outcome. Business metric.</li>
      <li>Deliverable. No need for it to be detailed.</li>
      <li>Constraints. How not to solve the problem.</li>
    </ul>
  </li>
  <li>Timebox the project. Based on the timeline, design a solution that fits. Ref: <a href="#timeboxing-projects">timebox section</a>.</li>
  <li>Literature review
    <ul>
      <li>It does not have to be exhaustive.</li>
      <li>Quickly identify approaches that have worked and build on them.</li>
      <li>Refer to the <a href="#literature-review">lit. review section</a>.</li>
    </ul>
  </li>
  <li>Reviews
    <ul>
      <li>Schedule once you have the results from the initial experiments.</li>
      <li>It helps with catching blindspots or critical errors.</li>
      <li>Focus points
        <ul>
          <li>Input data and features</li>
          <li>Offline evaluation</li>
          <li>Room for improvements.</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Set up the work environment. Read: <a href="https://eugeneyan.com/writing/setting-up-python-project-for-automation-and-collaboration/#automate-checks-with-each-git-push-and-pull-request">How to Set Up a Python Project For Automation and Collaboration</a></li>
  <li>Consistent documentation during the project.
    <ul>
      <li>Document whatever is not in the code.</li>
      <li>Create documentation like an applied research paper: motivation, lit review, data, methodology, results, and next steps.</li>
      <li>It helps with replication.</li>
      <li>Read more here: <a href="https://eugeneyan.com/writing/why-you-need-to-follow-up-after-your-data-science-project/#make-your-work-reproducible-each-run-every-run">Why You Need to Follow Up After Your Data Science Project</a>.</li>
    </ul>
  </li>
  <li>Have informal stand-ups with the team:
    <blockquote>
      <p>Share unusual findings; Discuss ideas; Get help on bugs; Ask for reviews; etc.</p>
    </blockquote>
  </li>
  <li>Regular stakeholder communication
    <ul>
      <li>Check in regularly with them.</li>
      <li>It ensures that the deliverable aligns with the overall goals.</li>
      <li>It is also a source of feedback and clever suggestions.</li>
    </ul>
  </li>
  <li>Read more here:
    <ul>
      <li><a href="https://eugeneyan.com/writing/what-i-do-before-a-data-science-project-to-ensure-success/#first-draw-the-map-to-the-destination-one-pager">What I Do Before a Data Science Project to Ensure Success</a></li>
      <li><a href="https://eugeneyan.com/writing/what-i-do-during-a-data-science-project-to-ensure-success/">What I Do During A Data Science Project To Deliver Success</a></li>
    </ul>
  </li>
</ul>

<h2 id="timeboxing-projects">Timeboxing Projects</h2>

<ul>
  <li>It makes you focus on the most crucial tasks.</li>
  <li>Timebox: stretch goals wrt the project.</li>
  <li>Estimate: Upper bound of effort needed.</li>
  <li>An estimate to go from timebox to estimate: multiply by 1.5 - 3.0.</li>
  <li>Most aggressive timebox: halve the time spent on a similar project. Create an MVP. Quick iteration cycles. Intense.</li>
  <li>Comfortable-yet-challenging timebox: reduce the time by 10-20%. Good default.</li>
  <li>Standard timebox: for open-ended projects. 2 weeks lit. review, 4-8 weeks for prototype building, and 3-6 months for production.</li>
</ul>

<h2 id="executing-projects">Executing Projects</h2>

<p>Mechanism to execute projects with high confidence.</p>

<ul>
  <li><strong>Pilot</strong> and <strong>copilot</strong> for each project.</li>
  <li>Pilot: main project owner.
    <ul>
      <li>Responsible for success/failure</li>
      <li>Own and delegate as required.</li>
    </ul>
  </li>
  <li>Copilot: helps the pilot stay on track, identify critical flaws, and call out blindspots.
    <ul>
      <li>Periodic check-ins</li>
      <li>Reviews document drafts and prototypes</li>
      <li>Mandatory code reviewer</li>
    </ul>
  </li>
  <li>Copilot has (more) experience in the problem space.</li>
  <li>Copilot spends 10% of the pilot’s effort.</li>
</ul>

<h2 id="literature-review">Literature Review</h2>

<ul>
  <li>Always start the project with a literature review.</li>
  <li>Read papers relevant to the problem.</li>
  <li>Start with applied research: <a href="https://github.com/eugeneyan/applied-ml">applied-ml</a>.</li>
  <li>Reviewing papers for problem understanding
    <ul>
      <li><strong>Formulation</strong>
        <blockquote>
          <p>Classification, regression, or something else?</p>
        </blockquote>
      </li>
      <li><strong>Data processing</strong>
        <blockquote>
          <p>How was data excluded, preprocessed, and rebalanced? How were labels defined? Was a third neural class added? How were labels augmented, perhaps via hard mining?</p>
        </blockquote>
      </li>
      <li><strong>Evaluation process</strong>
        <blockquote>
          <p>How was the training and validation set created? What offline evaluation metrics did they use? How did they improve the correlation between offline and online evaluation metrics?</p>
        </blockquote>
      </li>
    </ul>
  </li>
  <li>How to go through each paper is discussed in the next section.</li>
</ul>

<h2 id="3-pass-approach-for-reading-papers">3-Pass Approach for Reading Papers</h2>

<p>For single paper</p>

<ol>
  <li>Scan the abstract and conclusion to understand if the paper is useful. If it does, then skim through the headings to identify the problem statement, methods, and results.</li>
  <li>In the 2nd pass, highlight the relevant sections. Helps in quickly spotting the important bits later. Take notes. For most of the papers, 2nd pass is enough.</li>
  <li>Do a 3rd pass to cement the knowledge.</li>
</ol>

<p>For multiple papers from the same domain</p>

<ol>
  <li>Do 1st and 2nd passes on each paper.</li>
  <li>In the 3rd pass, consolidate common concepts across papers into a single note and compare the pros and cons. Doing this helps identify gaps in my knowledge. If there are gaps, then revisit the paper.</li>
</ol>

<p>Find more here: <a href="https://eugeneyan.com/writing/why-read-papers/#how-to-read-papers">How Reading Papers Helps You Be a More Effective Data Scientist</a>.</p>

<h2 id="collaboration-and-standard-practices">Collaboration and Standard Practices</h2>

<ul>
  <li>Create shared libraries for oft-used data operations.
    <ul>
      <li>It encourages the team to contribute and thus leads to collaboration and code reviews.</li>
      <li>It nudges people towards a team mindset.</li>
    </ul>
  </li>
  <li>Have a single repo with training, evaluation, and inference code in one place.
    <ul>
      <li>Everybody works and reviews the same code.</li>
      <li>It helps in knowledge sharing.</li>
      <li>It also slows down the speed, but the pros outweigh the cons.</li>
    </ul>
  </li>
  <li>Read more: <a href="https://www.ethanrosenthal.com/2023/01/10/data-scientists-alone/">Data scientists work alone and that’s bad</a>.</li>
</ul>

<p><br /></p>

<p>I will keep updating/adding to this list as I read/experiment more.</p>]]></content><author><name>Shivam Rana</name></author><summary type="html"><![CDATA[Good intentions never work, you need good mechanisms to make anything happen - Jeff Bezos.]]></summary></entry><entry><title type="html">Belagavi or Belgaum 📍</title><link href="https://trigonaminima.github.io/2023/01/belagavi/" rel="alternate" type="text/html" title="Belagavi or Belgaum 📍" /><published>2023-01-16T00:00:00+00:00</published><updated>2023-01-16T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2023/01/belagavi</id><content type="html" xml:base="https://trigonaminima.github.io/2023/01/belagavi/"><![CDATA[<figure class="single-image-popup">
  <a href="/assets/2023-01/itinerary.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/itinerary.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<h2 id="entering-belagavi">Entering Belagavi</h2>

<figure class="single-image-popup">
  <a href="/assets/2023-01/belagavi.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/belagavi.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p>My first interaction in the city: ₹250 for an auto for 2 KMs. That’s just robbery! He came down to ₹100 after I told him off. Then found another auto guy who said ₹80. (This was also high, but I was tired.)</p>

<p>Belagavi is the (proposed) 2nd capital of Karnataka. It is a controversial city: both Maharashtra (Maha) and Karnataka want it within their borders. I don’t understand this border dispute. 😞</p>

<p>Update: A journalist friend, Sarayu, gave me a few pointers here. Language is the reason for this fight. Most of North Karnataka up to Goa know Hindi. Thanks to Nizam rule and the districts that border MP and Maha. And Hindi becomes more important because Bombay is closer than Bangalore. On top of that, this region becomes strategically important because it is geographically closer to upcoming districts like Karwar. There is also a spurt of educational institutions like IIT Dharward/Raichur coming up.</p>

<p><strong>Language</strong>. Almost everyone understood Hindi. Everyone knew Kannada and Marathi.</p>

<p>Update: Belgaum is not a prosperous region and politically weak. Sarayu mentioned that since Mumbai is nearby, Belagavis go there for work. So, they learn Hindi and Marathi. This also means that youth leave for work to other cities and mostly old generations remain in the Belgaum.</p>

<p><strong>Payments</strong>. After using cash everywhere in Maha, I was surprised that barring a few auto drivers UPI was widely accepted in Belgaum. It looks like Belgaum is on its way to become a <a href="https://bscl.in/">smart city</a>.</p>

<p><strong>Stay</strong>. No hostels. Multiple hotels are available, but all average or below average. I chose a hotel in the market area.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-01/biryani.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/biryani.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p><strong>Food</strong>. The first thing was Belagavi Biryani. I tried it at a famous chain called <a href="https://goo.gl/maps/WozzrUhCcVkjrxRA8">Niyaaz Restaurant (Main Branch) 📍</a>. I’d rate it 3.5/5. A friend (met her later during the week) mentioned that Niyaaz is over-rated.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-01/kunda.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/kunda.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p>The second thing was a dessert called <a href="https://karnatakatourism.org/destinations/belagavi-kunda/">Belagavi Kunda</a>. I loved it! All the sweet shops sell Kunda. You can buy 100 grams of it for ₹20 or ₹25.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-01/thalis.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/thalis.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p>Last thing was to try the Maharashtrian thali available at indie restaurants.</p>

<p><strong>Tourist Points</strong>. There are two points in the Belagavi Fort area: <a href="https://goo.gl/maps/hikGGLXTwbAvpgfJ9">Kamala Basadi 📍</a> and <a href="https://goo.gl/maps/rEHFX7zfV1igxvVJ7">Safa Masjid 📍</a>.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-01/basadi.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/basadi.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p><strong>Story time</strong>. After wrapping up from work, I left for the Fort area. I reached my destination after walking 3 KMs and found that it was an army area. No one stopped me from going inside. I went to the mosque. Unfortunately for me, it was inside the Army protected area. My phone said 9 PM. The guards at the entry started questioning me. They let me go only when they were satisfied that I wasn’t a bad element. Of course I couldn’t see the masjid. Apparently, it is open to the general public only during <a href="https://en.wikipedia.org/wiki/Eid_al-Adha">Bakrid</a>. Afterwards, while coming out of the main area where no one stopped me while entering, another army personnel stopped me. Same barrage of questions came my way. On top of that, he checked my bag and photographed my ID. It turns out that that area becomes a no-movement zone after 9 PM. Next day I visited the Kamala Basadi and ended my tourist mode. 😅</p>

<figure class="single-image-popup">
  <a href="/assets/2023-01/vibe.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/vibe.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p><strong>Vibe</strong>. Most of my interaction was during transactions: shop owners, hotel folks, or auto drivers. I almost never got a polite response from anyone. They were not rude either. There was no warmth in the interaction. No one smiled. Just plain dry transaction. Belagavi is the first city where I felt like that.</p>

<p><strong>Story time</strong>. When I went to Kamala Basadi after the night incident, I went by an auto. I was already apprehensive of the auto guys from my 1st night. This guy turned out to be opposite. Abdul Rashid Shaikh and I talked about my reasons of coming to the town. The ₹250 story came up. He cursed all these auto guys. And the discussion (mostly me listening) went to good and bad deeds. He then took me to the Basadi and the Masjid (only to be denied entry by the army person). He then dropped me back to the Hotel. All within ₹200. I would have paid more if I had done these trips separately. This guy also suggested to skip a few cities from my itinerary. If you are in Belagavi, call for him on +91-8880866313.</p>

<p>I usually stay at hostels. Staying at a hotel was an experiment. Sadly, it was a failure. As a solo traveler, staying at hostels is more enjoyable than at a hotel. And usually hostels are in the cities with multiple points to explore or with good vibes. Belagavi had none. Thus, I decided to go to the cities with hostels.</p>

<figure class="single-image-popup">
  <a href="/assets/2023-01/itinerary2.jpg" style="text-align: center; margin: auto" aria-label="Open image">
    <img src="/assets/2023-01/itinerary2.jpg" loading="lazy" decoding="async" />
  </a>
  
</figure>

<p>Next stop: <strong>Hampi</strong>.</p>]]></content><author><name>Shivam Rana</name></author><category term="Travel" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">[Summary] Monolith: Real-Time RecSys With Collisionless Embeddings</title><link href="https://trigonaminima.github.io/2022/10/tiktok-monolith-review/" rel="alternate" type="text/html" title="[Summary] Monolith: Real-Time RecSys With Collisionless Embeddings" /><published>2022-10-31T00:00:00+00:00</published><updated>2022-10-31T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/10/tiktok-monolith-review</id><content type="html" xml:base="https://trigonaminima.github.io/2022/10/tiktok-monolith-review/"><![CDATA[<p>Paper link: <a href="http://arxiv.org/abs/2209.07663">Monolith: Real Time Recommendation System With Collisionless Embedding Table</a></p>

<h2 id="abstract">Abstract</h2>

<ul>
  <li>Real-time RecSys are important when customer feedback is time sensitive (eg: TikTok short-video ranking).</li>
  <li>The production-scale DL frameworks (PyTorch, TensorFlow) are designed with separate batch-training and model serving stages. This makes online training difficult.</li>
  <li>Presenting Monolith for online training:
    <ul>
      <li>Collisionless embeddings with expiry parameter and frequency filtering to reduce memory footprint</li>
      <li>Online training architecture with fault-tolerance in parameter server</li>
    </ul>
  </li>
  <li>Part of <a href="https://www.byteplus.com/en/product/recommend">BytePlus Recommend</a>.</li>
</ul>

<h2 id="data-in-recsys">Data in RecSys</h2>

<ul>
  <li>For many businesses driven by RecSys, better CX = real-time RecSys.</li>
  <li>Information from a user’s latest interaction become primary input as it’s the best signal of a user’s future interest and behavior.</li>
  <li>DL in RecSys
    <ul>
      <li><a href="https://arxiv.org/abs/1606.07792">Wide &amp; Deep Learning for Recommender Systems, 2016</a></li>
      <li><a href="https://dl.acm.org/doi/10.1145/2959100.2959190">Deep Neural Networks for YouTube Recommendation, 2016</a></li>
      <li><a href="https://arxiv.org/abs/1906.03109">The Architectural Implications of Facebook’s DNN-Based Personalized Recommendation, 2020</a></li>
      <li><a href="https://dl.acm.org/doi/10.1145/3326937.3341255">XDL: an industrial deep learning framework for high-dimensional sparse data, 2019</a></li>
      <li><a href="https://dl.acm.org/doi/abs/10.5555/3433701.3433728">Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations, 2020</a></li>
      <li><a href="https://dl.acm.org/doi/abs/10.1145/3357384.3358045">AIBox: CTR Prediction Model Training on a Single Node, 2019</a></li>
    </ul>
  </li>
  <li>DL in industry RecSys faces problems because of the real-world data.</li>
  <li>Data is different from CV/NLP tasks:
    <ul>
      <li>Features are mostly <strong>sparse, categorical, and dynamically changing</strong>.</li>
      <li><strong>Concept Drift</strong>: Training data distribution is non-stationary. (ref: <a href="https://dl.acm.org/doi/10.1145/2523813">A survey on concept drift adaptation, 2014</a>)</li>
    </ul>
  </li>
</ul>

<h3 id="sparsity-and-dynamism">Sparsity and Dynamism</h3>

<ul>
  <li>RecSys data has a lot of categoricals (eg: customer id, item id, item type, etc)</li>
  <li>Categorical features are sparse (eg: a user only buys limited items).</li>
  <li>Feature engineering of categorical features: map them to a <strong>high-dimensional embedding</strong> space.</li>
  <li>Issues with embeddings for categoricals:
    <ul>
      <li>Users and items are orders of magnitude larger than word-piece tokens in LMs. This <strong>enormous embedding table would hardly fit in memory</strong>.</li>
      <li>As more users and items are added, the size would increase further.</li>
    </ul>
  </li>
  <li>Current solution: <strong>Low-collision hashing</strong> to reduce the memory footprint and to allow the growing of IDs (user or item)
    <ul>
      <li>Assumptions:
        <ul>
          <li>Embedding table is distributed evenly in frequency. It is <strong>rarely true</strong> because only a small group of users or items have high frequency.</li>
          <li>Collisions are harmless to model output. But it is <strong>detrimental</strong> because organic growth in embedding table size leads to more collisions.</li>
        </ul>
      </li>
      <li>Ref: <a href="https://instagram-engineering.com/core-modeling-at-instagram-a51e0158aa48">Core Modeling at Instagram</a></li>
      <li>Ref: <a href="https://dl.acm.org/doi/10.1145/2959100.2959190">Deep Neural Networks for YouTube Recommendation, 2016</a></li>
    </ul>
  </li>
  <li>Thus, natural and constant demand to elastically adjust the users and items a RecSys tries to book-keep.</li>
</ul>

<h3 id="concept-drift">Concept Drift</h3>

<ul>
  <li>Underlying user distribution is non-stationary: user interests change with time (even during sessions).</li>
  <li>More recent data is more likely to predict change in user’s behavior.</li>
  <li>Mitigating concept drift: serving model should be updated as close to real-time as possible to reflect the latest user interests.</li>
</ul>

<h2 id="parameter-server-ps">Parameter Server (PS)</h2>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2022-10/5_tiktok_ps.png" alt="" style="display:block;text-align:center" width="300" />
</figure>

<ul>
  <li>Worker machines compute the gradients.</li>
  <li>PS machines store parameters and updates them according to gradients.</li>
  <li>Two kinds: 1) training PS; and 2) serving PS. Training PS holds training parameters. Once training is complete, it is synced to Serving PS.</li>
  <li>Two types of parameters:
    <ol>
      <li>Dense: weights/variables in DNN; and</li>
      <li>Sparse: embedding tables corresponding to sparse (categorical) features.</li>
    </ol>
  </li>
  <li>Since both dense and sparse features are part of the TensorFlow Graph, Monolith stores them on the PS.</li>
  <li>The <a href="https://www.tensorflow.org/guide/variable"><code class="language-plaintext highlighter-rouge">tf.Variable</code></a> is for dense variables. For sparse variables, authors created HashTable operations.</li>
</ul>

<h3 id="hash-table">Hash Table</h3>

<ul>
  <li>Representation of embeddings in TensorFlow and its limitation
    <ul>
      <li>The <a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding">tf.Embedding</a> layer uses variables to represent the dense embedding vectors. (the embedding matrix is of type <code class="language-plaintext highlighter-rouge">Variable</code>.)</li>
      <li>The <code class="language-plaintext highlighter-rouge">Variable</code> construct freezes the shape of the matrix throughout the training/serving process. Thus, is it a <strong>fixed-size</strong> embedding.</li>
      <li>As IDs increase with time, since the table size is fixed, ID collisions (while updating/using the dense embedding vector) would increase.</li>
    </ul>
  </li>
  <li>Authors implemented a new key-value HashTable.
    <ul>
      <li>Hashing algorithm: <a href="https://en.wikipedia.org/wiki/Cuckoo_hashing">Cuckoo hashing</a>[<a href="https://youtu.be/OBuGqu2d4v4">Visualization + Explanation - Youtube</a>]</li>
      <li>Lookup: <code class="language-plaintext highlighter-rouge">O(1)</code>; Insertion: amortized <code class="language-plaintext highlighter-rouge">O(1)</code></li>
      <li>Implemented as a TensorFlow resource operation (it likely means a <a href="https://www.tensorflow.org/tutorials/customization/custom_layers">TensorFlow custom layer</a>).</li>
      <li>Lookups and updates are implemented as native TF operations.</li>
    </ul>
  </li>
  <li>Naive insertion: insert every new ID in the HashTable. Will <strong>deplete memory</strong> quickly.</li>
  <li>Insertion by <strong>frequency</strong>
    <ul>
      <li>IDs (user, item, etc) have long-tail distribution.</li>
      <li>Infrequent IDs will have underfit embeddings because of less training data.</li>
      <li>Model quality will not suffer from removal of these IDs.</li>
      <li>Filter by a threshold of occurrences before insertion.</li>
      <li>The threshold is a <strong>tunable hyperparameter</strong> for each model.</li>
      <li>Also use a <strong>probabilistic filter</strong> (didn’t expand on it)</li>
    </ul>
  </li>
  <li>Insertion by <strong>staleness</strong>
    <ul>
      <li>Many IDs are never visited (user inactive, out-of-date item)</li>
      <li>Set a <strong>expiry time</strong> for each ID.</li>
      <li>The <strong>expiry time</strong> is tunable for each embedding table: different tables will have different sensitivity to historical information.</li>
    </ul>
  </li>
</ul>

<h2 id="model-training">Model Training</h2>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2022-10/6_tiktok_training.png" alt="" style="display:block;text-align:center" width="100%" />
<figcaption style="text-align: center">Training Engine in Monolith</figcaption>
</figure>

<ul>
  <li>Engineering steps
    <ol>
      <li>User logs (click, like, buy) go to <a href="https://kafka.apache.org/">Kakfa</a>.</li>
      <li>Model features are present in the another Kafka (didn’t discuss what features)</li>
      <li>Create the training example by joining the features with user logs using a <a href="https://flink.apache.org/">Flink job</a>.
        <ul>
          <li>First, check for the data in in-memory cache;</li>
          <li>If not found, then go to on-disk key-value storage (happens in cases when user feedback arrives after days and in-memory cache is cleared to free-up the memory)</li>
        </ul>
      </li>
      <li>Push the created training example to a 3rd Kafka queue.</li>
      <li>Push data from the 3rd queue to HDFS for offline training mode.</li>
      <li>Trigger online or offline training</li>
      <li>Push the updated parameters to the Training PS</li>
      <li>Sync the Serving PS with the Training PS</li>
    </ol>
  </li>
  <li>Batch training stage
    <ul>
      <li>Ordinary TF training loop
        <ol>
          <li>Training worker reads a mini-batch from storage.</li>
          <li><strong>Request parameters from PS.</strong></li>
          <li>Compute a forward and backward pass.</li>
          <li><strong>Push the updated parameters to training PS.</strong></li>
        </ol>
      </li>
      <li><strong>Only train for a single pass over the data.</strong> (to mimic the online training phase?)</li>
      <li>Useful when: model architecture is modified and require retraining.</li>
    </ul>
  </li>
  <li>Online training stage
    <ul>
      <li>Triggered when the model is online.</li>
      <li>Steps:
        <ol>
          <li>Training worker consumes real-time data from a Kafka queue.</li>
          <li>Update the parameters in the training PS.</li>
          <li><strong>Push the updated parameters to training PS.</strong></li>
        </ol>
      </li>
    </ul>
  </li>
  <li>Negative sampling
    <ul>
      <li>To handle the highly skewed negative to positive sample ratio.</li>
      <li>It changes the underlying distribution of the trained model: higher probability of making positive predictions.</li>
      <li>Apply <strong>log-odds correction</strong> during serving to ensure the online model is an unbiased estimator of the OG distribution. (ref: <a href="https://arxiv.org/abs/2110.13048">Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data, 2021</a>)</li>
    </ul>
  </li>
  <li>Parameter sync. between training and serving PS
    <ul>
      <li>Production models are <strong>TB in size</strong>.</li>
      <li>Replacing all the parameters will take time.</li>
      <li>It will also consume network bandwidth and extra storage (need to store the new parameters before replacing the old ones).</li>
      <li>Solution: incremental periodic parameter sync.
        <ol>
          <li>Sparse features (aka embedding tables): Sync the keys whose vectors updated during the last <strong>1 minute</strong>.</li>
          <li>Dense variables (aka model weights): model weights move much slower because the momentum-based optimisers take more time to build momentum over the big data. Thus the sync frequency is 1-day. The authors found the stale weights tolerable.</li>
        </ol>
      </li>
    </ul>
  </li>
  <li>Fault tolerance: periodic model snapshots
    <ul>
      <li>Trade-off between: model quality (because of the loss of recent updates) and computation overhead (copy-pasting TB of data)</li>
      <li>Snapshot frequency: 1-day. Experiments revealed that performance degradation was tolerable.</li>
    </ul>
  </li>
</ul>

<h2 id="evaluation">Evaluation</h2>

<p>(I will skip the experiment setup and jump over to the results.)</p>

<h3 id="the-effect-of-embedding-collision">The Effect of Embedding Collision</h3>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2022-10/7_tiktok_collision.png" alt="" style="display:block;text-align:center" width="100%" />
</figure>

<ul>
  <li>Model with collisionless embedding vectors consistently outperform the one with collision.</li>
  <li>Independent of <strong>training epochs</strong> and <strong>concept drift</strong> (non-stationary training data)</li>
</ul>

<h3 id="online-training-vs-batch-training">Online Training vs Batch Training</h3>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2022-10/8_tiktok_online_training.png" alt="" style="display:block;text-align:center" width="100%" />
<figcaption style="text-align: center">Online training vs Batch training on Criteo dataset.</figcaption>
</figure>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2022-10/8_tiktok_online_trainingb.png" alt="" style="text-align: center; margin: auto" width="500" />
<figcaption style="text-align: center">Different sync intervals for online training</figcaption>
</figure>

<ul>
  <li>Online training has better performance than the batch training.
    <ul>
      <li>AUC of online training models: evaluated by the following shard of data.</li>
      <li>AUC of batch training models: evaluated by each shard of data (?)</li>
      <li>General AUC delta ranged between 0.20 (5hr interval) to 0.40 (30 min interval).</li>
    </ul>
  </li>
  <li>Smaller parameter sync interval (or higher parameter sync freq.) performs better than the larger intervals.</li>
  <li>Based on these results, best sync frequency for sparse features that the systems could endure was <strong>1 minute</strong>.
    <ul>
      <li>Assuming 100,000 IDs with 1024 vector size are updated each minute: <strong>~400 MB</strong> (4 KB * 100,000) network transfer per minute.</li>
    </ul>
  </li>
  <li>Sync frequency for dense features is 1-day (every midnight) as they update slowly.</li>
</ul>

<blockquote>
  <p>Is the 2nd figure correct? The 5hr sync interval model should degrade till the sync happens. After sync, it should have similar AUC as other models. It should then degrade again from that point until the next sync. That is not happening here. What am I missing?</p>
</blockquote>

<h3 id="ps-reliability">PS Reliability</h3>

<ul>
  <li>Hypothesis: minute-level parameter syncing should mean frequent snapshots.</li>
  <li><strong>Wrong</strong>. Observed no loss in the model quality even with 1-day snapshot interval.</li>
  <li>
    <p>Below excerpt explains the reason:</p>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2022-10/9_tiktok_PS_failure.png" alt="" width="300" />
  </figure>
  </li>
  <li><strong>Lesson: don’t take frequent snapshots and save resources.</strong></li>
</ul>

<h2 id="summary-conclusion">Summary Conclusion</h2>

<ul>
  <li>The paper proposed the following:
    <ol>
      <li>Cuckoo HashMap based collisionless embedding tables</li>
      <li>Online training and parameter sync architecture</li>
    </ol>
  </li>
  <li>With extensive experimentations (both offline and online) they showed that:
    <ul>
      <li>Collisionsless embedding table has a positive impact on the model quality (AUC gains ranged from 0.20% to 0.40%)</li>
      <li>Online training performs better than batch training in RecSys setting.</li>
      <li>Higher parameter sync freq is better (1 minute for prod systems); and</li>
      <li>It is okay to have a smaller parameter snapshotting frequency (1-day for prod systems)</li>
    </ul>
  </li>
  <li>This paper was a write-up of engineering tricks that ByteDance employed to build their RecSys.</li>
  <li>The few nuggets of ML that I noticed:
    <ul>
      <li>Apply log-odds correction to the data in online serving to make up for negative sampling.</li>
      <li>Online real-time model at ByteDance is a multi-tower architecture where each tower is responsible for learning a special kind of user behavior. (Is than an allusion to multi-objective ranking through different towers?)</li>
    </ul>
  </li>
</ul>

<p>Overall, this paper adds to my belief that a successful system requires clever engineering.</p>]]></content><author><name>Shivam Rana</name></author><category term="Publication" /><category term="RecSys" /><summary type="html"><![CDATA[Paper link: Monolith: Real Time Recommendation System With Collisionless Embedding Table]]></summary></entry><entry><title type="html">Learning How to be a Mentor</title><link href="https://trigonaminima.github.io/2022/10/how-to-be-a-mentor/" rel="alternate" type="text/html" title="Learning How to be a Mentor" /><published>2022-10-13T00:00:00+00:00</published><updated>2022-10-13T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/10/how-to-be-a-mentor</id><content type="html" xml:base="https://trigonaminima.github.io/2022/10/how-to-be-a-mentor/"><![CDATA[<p>Some people are born coaches. I am not one of them.</p>

<p>One does not get mentoring lessons at school. There is not enough time to read books on effective coaching. My only guides have been two specific rules.</p>

<p>First is thinking about how I would have wanted my mentor to coach me. And then coach my mentee the same way.</p>

<p>The second is to observe my current mentors. Notice the techniques that enable my growth. Inculcate these methods in my mental model. Similarly, spot where they are ineffective and learn to avoid them.</p>

<h2 id="the-flaw">The Flaw</h2>

<p>Recently, my boss inadvertently showed me a flaw in my first rule.</p>

<blockquote>
  <p>Good design is thorough down to the last detail. Nothing must be arbitrary or left to chance. Care and accuracy in the design process show respect towards the consumer.</p>

  <p>- <a href="https://en.wikipedia.org/wiki/Dieter_Rams">Deiter Rams’</a> 8th principle of Good Design.</p>
</blockquote>

<p>That embodies my personality. I believe that my work should be proper (read, perfect). And, at times, I become rigid about maintaining that standard. Thanks to this, my output is usually of good quality (not bragging 😅). That satisfies me. The gratification keeps me intrinsically motivated. Thus, I continue to work like this.</p>

<p>My rule assumes how <em>I</em> would have wanted my mentor to coach <em>me</em>. That means that I presume my coach to have equally high standards. It is a flaw. Since everyone is different, my way of operating does not work for everyone.</p>

<h2 id="patience-is-a-virtue">Patience is a Virtue</h2>

<p>I also have a patience problem. When I think I can do something faster and the other person takes more time, then that annoys me. I have worked on this quite a lot in the last few years. But there is room for improvement.</p>

<h2 id="what-is-next">What is Next?</h2>

<p>My overarching goal is to be a good leader. I believe a good leader is also an effective mentor and coach. So, I am actively going to make myself a good mentor.</p>

<p>The following are the next steps for me:</p>

<ol>
  <li>Stop judging by the yardstick of “is this how I would have done it?”</li>
  <li>It is okay if things are not how I thought they would be. If it is 80% there, it is good enough.</li>
  <li>If there are areas of refinement, then definitely point them out.</li>
  <li>Stop thinking that I could have done it quickly. Get comfortable with others being slow/fast.</li>
</ol>

<p>Let’s see how it goes.</p>]]></content><author><name>Shivam Rana</name></author><category term="Leadership" /><summary type="html"><![CDATA[Some people are born coaches. I am not one of them.]]></summary></entry><entry><title type="html">[Summary] Deep Recurrent Neural Networks for OYO Hotels Recommendation</title><link href="https://trigonaminima.github.io/2022/10/oyo-hotels-recommendations/" rel="alternate" type="text/html" title="[Summary] Deep Recurrent Neural Networks for OYO Hotels Recommendation" /><published>2022-10-09T00:00:00+00:00</published><updated>2022-10-09T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/10/oyo-hotels-recommendations</id><content type="html" xml:base="https://trigonaminima.github.io/2022/10/oyo-hotels-recommendations/"><![CDATA[<p>Paper link: <a href="https://link.springer.com/chapter/10.1007/978-3-031-08333-4_20">
Deep Recurrent Neural Networks for OYO Hotels Recommendation</a></p>

<h2 id="abstract">Abstract</h2>

<ul>
  <li>A hybrid model with two parts:
    <ol>
      <li>Embedding generation: generate implicit embeddings of properties.</li>
      <li>Deep prediction and ranking model.</li>
    </ol>
  </li>
  <li>The model performed well over the existing collab-filtering model.</li>
</ul>

<h2 id="situationcontext">Situation/Context</h2>

<ul>
  <li>OYO’s current recommendation system
    <ul>
      <li>Graph-based Collaborative filtering model</li>
      <li>Optimised on browsing data as user feedback</li>
      <li>Objective: CTR</li>
    </ul>
  </li>
  <li>DL provides an opportunity to improve the system.</li>
</ul>

<h2 id="lit-review">Lit Review</h2>

<ul>
  <li>Conventional RecSys algos:
    <ol>
      <li>Collab-filtering,</li>
      <li>Content-based, and</li>
      <li>Hybrid</li>
    </ol>
  </li>
  <li><a href="https://dl.acm.org/doi/10.1145/2959100.2959190">YouTube’s 2016 paper</a> has demonstrated that DL-based RecSys can give SOTA results on high-volume data.</li>
  <li>MF only considers the linear combination of user and item latent vectors. Whereas, DL can capture non-linear user-item relationships.</li>
  <li>DL reduces the feature engineering efforts.</li>
  <li>RNN facilitate temporal behaviour of user-item interactions: <strong>useful for session-based sequential recommendations</strong>. Conventional algos don’t capture this.</li>
  <li>Research on modelling user behaviour sequences using LSTM or GRUs
    <ul>
      <li><a href="https://dl.acm.org/doi/10.1145/3109859.3109877">Sequential User-based Recurrent Neural Network Recommendations, 2017</a></li>
      <li><a href="https://arxiv.org/abs/1706.03847">Recurrent Neural Networks with Top-k Gains for Session-based Recommendations, 2018</a></li>
      <li><a href="https://arxiv.org/abs/1511.06939">Session-based Recommendations with Recurrent Neural Networks, 2016</a></li>
      <li><a href="https://arxiv.org/abs/1711.04725">Neural Attentive Session-based Recommendation, 2017</a></li>
      <li><a href="https://arxiv.org/abs/1706.04148">Personalizing Session-based Recommendations with Hierarchical Recurrent Neural Networks, 2017</a></li>
      <li><a href="https://dl.acm.org/doi/10.1145/3018661.3018689">Recurrent Recommender Networks, 2017</a></li>
      <li><a href="https://dl.acm.org/doi/10.1145/2911451.2914683">A Dynamic Recurrent Model for Next Basket Recommendation, 2016</a></li>
    </ul>
  </li>
  <li>This <a href="https://dl.acm.org/doi/pdf/10.1145/3219819.3219885">Airbnb paper</a> (<a href="https://www.linkedin.com/pulse/embeddings-paper-review-real-time-personalization-using-malhotra/">summary</a>) takes the sequence of listing ids clicked by the users and trains a skip-gram word2vec model on it. And then rank using these embeddings.</li>
  <li>The authors of this paper mention that they improve it by adding entity features along with click data.</li>
</ul>

<h2 id="methodology">Methodology</h2>

<ul>
  <li>Embedding generation: generates embeddings of the hotels (intermediate output of the next step).</li>
  <li>Prediction and ranking model: gets top-n recommendations-based on the following inputs:
    <ul>
      <li>The sequence of browsed hotels</li>
      <li>Embeddings of the browsed hotels</li>
      <li>Rating tokens of the browsed hotels</li>
      <li>Realisation tokens of the browsed hotels</li>
    </ul>
  </li>
</ul>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2022-10/1_oyo_recsys_schema.png" alt="OYO RecSys Schema" style="display:block;text-align:center" width="100%" />
</figure>

<ul>
  <li><strong>What was the candidate list of hotels?</strong> High-rated hotels?</li>
</ul>

<h3 id="embedding-gen">Embedding Gen</h3>

<ul>
  <li><strong>Explicit feedback</strong> requires effort from the customers; hence, <strong>ratings are sparse</strong>.</li>
  <li>Browsing data as user’s <strong>implicit feedback</strong>; thus, <strong>no sparsity</strong>.</li>
  <li>In this work, implicit features were derived using an RNN.
    <ul>
      <li>Embeddings were the intermediate output of the model training process.</li>
    </ul>
  </li>
</ul>

<h3 id="prediction-and-ranking-model">Prediction and Ranking Model</h3>

<ul>
  <li>Objective: realised bookings (conversion along with the realization of bookings)</li>
  <li>Implemented the following four methods: RNN, GRU, LSTM, and BiLSTM.</li>
  <li>Training data:
    <ul>
      <li>1 million users</li>
      <li>Sequences of their clicked hotels within a session</li>
    </ul>
  </li>
  <li>Pre-processing: padded and limited to 15 hotels.</li>
  <li>Model objective: the probability of the user for realised booking at high-rated hotels.</li>
  <li>Proposed architecture (disclaimer: I couldn’t grok it from the paper)
    <ol>
      <li>
        <p>Embedding layer: 100 dim</p>

        <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">num_embeddings</span><span class="o">=</span><span class="n">all_hotels</span><span class="p">,</span> <span class="n">embedding_dim</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</code></pre></div>        </div>
      </li>
      <li>Embedding concat layer (<code class="language-plaintext highlighter-rouge">torch.cat()</code>)
        <ul>
          <li><strong>Not sure why they concatenated the embeddings..</strong> The embedding tensor should have been input to the RNN layer. Otherwise, no <em>recurrence</em> will happen.</li>
        </ul>
      </li>
      <li>
        <p>2 BiLSTM layers</p>

        <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span>
     <span class="n">input_size</span><span class="o">=</span><span class="mi">1530</span><span class="p">,</span>
     <span class="n">hidden_size</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span>
     <span class="n">num_layers</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
     <span class="n">bidirectional</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
 <span class="p">)</span>
</code></pre></div>        </div>
      </li>
      <li>Flatten layer (<code class="language-plaintext highlighter-rouge">torch.nn.Flatten()</code>)</li>
      <li>
        <p>4 ReLU dense layers</p>

        <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">l1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">512</span><span class="o">*</span><span class="mi">2</span><span class="p">,</span> <span class="mi">512</span><span class="p">))</span>
 <span class="n">l2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="mi">256</span><span class="p">))</span>
 <span class="n">l3</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="mi">128</span><span class="p">))</span>
 <span class="n">l4</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div>        </div>
      </li>
      <li>Softmax layer</li>
      <li>Output layer</li>
    </ol>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2022-10/2_oyo_ranking.png" alt="" style="display:block;text-align:center" width="100%" />
  </figure>
  </li>
</ul>

<h2 id="embedding-evaluation">Embedding Evaluation</h2>

<ul>
  <li>Embedding dimension: 100</li>
  <li>Get the top 10 similar hotels for all the hotels in the training dataset using cosine similarity.</li>
  <li>Four accuracy metrics:
    <ol>
      <li>Location</li>
      <li>Distance</li>
      <li>Price</li>
      <li>Ratings</li>
    </ol>
  </li>
  <li>
    <p>Metric formulation</p>

\[\text{Sim Index @ x} = \frac{\sum_{i=1}^{H} \text{sim@x}(\text{top-10 hotels}, i)}{H} \\\]

    <ul>
      <li>\(x\) can be any of the following: Location, Distance, Price, Ratings</li>
      <li>\(H\) is a set of all query hotels;</li>
      <li>\(\text{sim@x}(\text{top-10 hotels}, i)\) is the similarity score for metric \(x\).</li>
      <li>Ranges between 0 and 10.</li>
    </ul>
  </li>
  <li>\(\text{sim@Location}\): fraction of top-10 hotels lying in the same city as the query hotel \(i\).</li>
  <li>\(\text{sim@Distance}\): fraction of top-10 hotels that are within a 20km radius of the query hotel \(i\).</li>
  <li>\(\text{sim@Price}\): fraction of top-10 hotels that are within +/-15% of the price of the query hotel \(i\).</li>
  <li>\(\text{sim@Ratings}\): fraction of top-10 hotels that are +/-1 rating from the query hotel \(i\).</li>
  <li>Following are the evaluation results with the winner highlighted:
    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2022-10/3_oyo_embed_eval.png" alt="" style="display:block;text-align:center" width="80%" />
  </figure>
  </li>
  <li>Qualitative eval also yielded positive results.</li>
</ul>

<h2 id="ranking-model-evaluation">Ranking Model Evaluation</h2>

<ul>
  <li>Offline evaluation metric: Hit Ratio, <a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank">MRR</a></li>
  <li>
    <p>\(\text{Precision@k}\) or Hit Ratio: fraction of users for which the booked hotel was among the top-k recommendations.</p>

\[\text{Precision@k} = \frac{U_{hit}^k}{U_{all}}\]
  </li>
  <li>15 total model variants:
    <ul>
      <li>3 variants with basic RNN</li>
      <li>4 variants each with LSTM, GRU, and BiLSTM</li>
    </ul>
  </li>
  <li>Selected one variant from each model type-based on validation results.</li>
  <li>Created a dataset aligned with the real-time environment. (Session logs?)</li>
  <li>Out-of-time validation on this dataset.</li>
  <li>The BiLSTM variant was the best-performing model.
    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2022-10/4_oyo_rank_eval.png" alt="" style="display:block;text-align:center" width="80%" />
  </figure>
  </li>
  <li>Online evaluation metrics:
    <ul>
      <li>Realized bookings at high-rated hotels.</li>
      <li>C*R (multiplication of booking conversion and realization of bookings) at high-rated hotels.</li>
    </ul>
  </li>
  <li>Observed lifts of 3% to 6% in realized hotel bookings across different geographies.</li>
</ul>

<h2 id="review-conclusion">Review Conclusion</h2>

<ul>
  <li>The paper proposed building a DL model with two parts: embedding gen and ranking model.</li>
  <li>The embeddings are the intermediate output of the ranking model. Not sure why it is called a separate model in the paper.</li>
  <li>The model is an important part of this paper, yet
    <ul>
      <li>It does not discuss the training data construction in detail.</li>
      <li>Few left-out details about the architecture made it difficult to comprehend it.</li>
      <li>There was no discussion about inferencing and the candidate set of restaurants to rank.</li>
    </ul>
  </li>
  <li>The embedding evaluation framework was comprehensive and quantified the effectiveness of the embeddings.</li>
  <li>Model evaluation methodology followed the standard process of train-time validation and out-of-time validation steps.</li>
  <li>One thing lacking was comparison with tree-based models like gradient boosted trees which have shown good performance in recommendation tasks in both industry and research.</li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="Publication" /><category term="RecSys" /><summary type="html"><![CDATA[Paper link: Deep Recurrent Neural Networks for OYO Hotels Recommendation]]></summary></entry><entry><title type="html">Passing Multiple Parameters in PySpark MapPartitions</title><link href="https://trigonaminima.github.io/2022/09/pyspark-mappartitions-multiple-parameters/" rel="alternate" type="text/html" title="Passing Multiple Parameters in PySpark MapPartitions" /><published>2022-09-25T00:00:00+00:00</published><updated>2022-09-25T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/09/pyspark-mappartitions-multiple-parameters</id><content type="html" xml:base="https://trigonaminima.github.io/2022/09/pyspark-mappartitions-multiple-parameters/"><![CDATA[<p><strong>Alternate title</strong>: k-Nearest Neighbours (kNN) in PySpark</p>

<p>You can follow the story of what I wanted to do and how I did it. Or <a href="#final-solution">jump</a> to the solution.</p>

<h2 id="situation">Situation</h2>

<ul>
  <li>The PySpark MLlib (<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html">DataFrame-based</a>, <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.mllib.html">RDD-based</a>) does not support <a href="https://en.wikipedia.org/wiki/K-nearest_neighbours_algorithm">kNN algorithm</a>.</li>
  <li>The multiple SO questions (<a href="https://stackoverflow.com/q/39509095/2650427">1</a>, <a href="https://stackoverflow.com/q/37767790/2650427">2</a>, <a href="https://stackoverflow.com/q/62896411/2650427">3</a>) did not help.</li>
  <li>I did not get a chance to try the available open-source code samples (<a href="https://github.com/saurfang/spark-knn">1</a>, <a href="https://github.com/jakac/spark-python-knn">2</a>).</li>
  <li>There were two stable-ish solutions available:
    <ol>
      <li>Use Spotify’s library called <a href="https://github.com/spotify/annoy">Annoy</a>.</li>
      <li>Use Scikit-learn’s <a href="https://Scikit-learn.org/stable/modules/generated/sklearn.neighbours.NearestNeighbours.html">implementation of kNN</a>.</li>
    </ol>
  </li>
  <li>Both methods only use a single node. So, the benefit of Spark’s distributed processing goes out of the window.</li>
  <li>That’s a deal breaker because I have a large data set (~20+ mill records) with long vectors.</li>
  <li>I picked Annoy because I found it first. I discuss at the end why Scikit could be more performant.</li>
</ul>

<p>The task is to parallelise the Annoy code across multiple nodes of the Spark cluster.</p>

<h2 id="solution">Solution</h2>

<ul>
  <li>A hint about the solution is present in this <a href="https://stackoverflow.com/a/38626686/2650427">SO Answer</a>.</li>
  <li>For both Annoy and Scikit, the approach is as follows:
    <ol>
      <li>Build the index or fit the model on a single node. Nothing is distributed here.</li>
      <li>Broadcast the index (or model) across the cluster to find the nearest neighbours of a given vector.</li>
    </ol>
  </li>
</ul>

<h3 id="building-the-index">Building the index</h3>

<ul>
  <li>I first tried to use <a href="https://github.com/mskimm/spark-annoy">spark-annoy</a>. It is in Scala. The benefit of this library was that we could build the index in a distributed manner. Unfortunately, I could not figure it out.</li>
  <li>The default was to use the iterative approach of building the index on a single node.</li>
  <li>The following is the method to build the index:</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">annoy</span> <span class="kn">import</span> <span class="n">AnnoyIndex</span>

<span class="k">def</span> <span class="nf">build_annoy_index</span><span class="p">(</span><span class="n">vectors</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">num_trees</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100</span><span class="p">):</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">AnnoyIndex</span><span class="p">(</span><span class="n">dim</span><span class="p">,</span> <span class="n">metric</span><span class="o">=</span><span class="s">'angular'</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">vector</span> <span class="ow">in</span> <span class="n">vectors</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
        <span class="n">t</span><span class="p">.</span><span class="n">add_item</span><span class="p">(</span><span class="n">index</span><span class="p">,</span> <span class="n">vector</span><span class="p">)</span>
    <span class="n">t</span><span class="p">.</span><span class="n">build</span><span class="p">(</span><span class="n">num_trees</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">t</span>
</pre></td></tr></tbody></table></code></pre></figure>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">for</code> loop makes it a long-running process if the data is large. Sadly, it is unavoidable.</li>
  <li>Note that the type of <code class="language-plaintext highlighter-rouge">vectors</code> is <code class="language-plaintext highlighter-rouge">pd.Series</code>. I used pandas to get the goodness of indexes. It can be a <code class="language-plaintext highlighter-rouge">list</code> or any other iterable. It should be an iterable irrespective of its type. That means either of the following:
    <ol>
      <li>Run <code class="language-plaintext highlighter-rouge">.collect()</code> on the Spark DataFrame;</li>
      <li>Turn the spark DataFrame into a pandas DataFrame.</li>
    </ol>
  </li>
  <li>That will bring all the data to a single node. So, it can potentially lead to OOM error.</li>
</ul>

<h3 id="finding-nearest-neighbours">Finding Nearest Neighbours</h3>

<ul>
  <li>We can parallelise this step.</li>
  <li>We have to broadcast the Annoy index across all the nodes of the Spark cluster.</li>
  <li>
    <p>The Annoy indexes are memory mapped.</p>

    <blockquote>
      <p>It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.</p>
    </blockquote>
  </li>
  <li>It will fail if we broadcast it using <code class="language-plaintext highlighter-rouge">sc.broadcast(t)</code>. This <a href="https://stackoverflow.com/a/35190477/2650427">SO answer</a> discusses this issue.</li>
  <li>The solution: write the index to a file and send the file to all the workers to load.</li>
  <li>Use <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addFile.html"><code class="language-plaintext highlighter-rouge">sc.addFile()</code></a> to send the file to the workers.</li>
  <li>Use <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkFiles.get.html#pyspark.SparkFiles.get"><code class="language-plaintext highlighter-rouge">SparkFiles.get()</code></a> to get the file path and load it in the worker node.</li>
  <li>Here is the method to load the Annoy index in the worker nodes:</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">load_annoy_index</span><span class="p">(</span><span class="n">index_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">dim</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
    <span class="kn">from</span> <span class="nn">annoy</span> <span class="kn">import</span> <span class="n">AnnoyIndex</span>

    <span class="n">index</span> <span class="o">=</span> <span class="n">AnnoyIndex</span><span class="p">(</span><span class="n">dim</span><span class="p">,</span> <span class="n">metric</span><span class="o">=</span><span class="s">'angular'</span><span class="p">)</span>
    <span class="n">index</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">SparkFiles</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">index_file</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">index</span>
</pre></td></tr></tbody></table></code></pre></figure>

<ul>
  <li>I call the below method to get the nearest neighbours of a set of index ids:</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">find_neighbours</span><span class="p">(</span><span class="n">index_file</span><span class="p">,</span> <span class="n">top_n</span><span class="p">,</span> <span class="n">dim</span><span class="p">,</span> <span class="n">item_batch</span><span class="p">):</span>
    <span class="n">index</span> <span class="o">=</span> <span class="n">load_annoy_index</span><span class="p">(</span><span class="n">index_file</span><span class="p">,</span> <span class="n">dim</span><span class="p">)</span>

    <span class="c1"># get similar items
</span>    <span class="n">sim_items</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">item_batch</span><span class="p">:</span>
        <span class="n">top_n_items</span> <span class="o">=</span> <span class="n">index</span><span class="p">.</span><span class="n">get_nns_by_item</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="n">item</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="n">top_n</span><span class="p">)</span>
        <span class="n">sim_items</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">item</span><span class="p">,</span> <span class="nb">list</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">top_n_items</span><span class="p">))))</span>
    <span class="k">return</span> <span class="n">sim_items</span>
</pre></td></tr></tbody></table></code></pre></figure>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">item_batch</code> is the list of Annoy index ids.</li>
  <li>The function returns the list of <code class="language-plaintext highlighter-rouge">(annoy_item_index, [(rank_1, sim_item_1), ..., (rank_n, sim_item_n)])</code>.</li>
  <li>For example, here is one such item from the list:</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(248, [[0, 248], [9, 284764], [3, 86148], [6, 265812], [7, 508155], [2, 48388], [10, 58786], [1, 154653], [5, 364419], [4, 4444], [8, 89955]])
</code></pre></div></div>

<ul>
  <li>To validate the function, the most similar item corresponding to the query item should be itself. In the above example, the query index <code class="language-plaintext highlighter-rouge">248</code> ranks <code class="language-plaintext highlighter-rouge">0</code> in the top similar items.</li>
  <li>To get the nearest neighbours by vectors, you pass the vectors in the <code class="language-plaintext highlighter-rouge">item_batch</code> and use the <code class="language-plaintext highlighter-rouge">get_nns_by_vector</code> method.</li>
  <li>I keep my <code class="language-plaintext highlighter-rouge">find_neighbours</code> method generic for the following parameters:
    <ul>
      <li>Annoy index file name: I can have any name based on my use case.</li>
      <li>Top n items: number of top items I want to retrieve.</li>
      <li>Dim: The dimension of the vector can vary depending on the various ML techniques (LDA, DL, etc.)</li>
    </ul>
  </li>
  <li>We have only written the method to find the nearest neighbours. How do we call it in a distributed manner? This <a href="https://stackoverflow.com/a/35190477/2650427">SO answer</a> answers that too.</li>
  <li>The answer is <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.mapPartitions.html"><code class="language-plaintext highlighter-rouge">mapPartitions</code></a>. This method will apply the passed function to each RDD partition. Go through the answers of this <a href="https://stackoverflow.com/q/21185092/2650427">SO question</a> to know more in detail.</li>
  <li>I will pass <code class="language-plaintext highlighter-rouge">find_neighbours</code> to the <code class="language-plaintext highlighter-rouge">mapPartitions</code>, and it will return an RDD with the nearest neighbours list.</li>
  <li>
    <div id="final-solution"></div>
    <p>But my <code class="language-plaintext highlighter-rouge">find_neighbours</code> implementation takes four parameters, and there is no way of sending <code class="language-plaintext highlighter-rouge">**args</code> inside the <code class="language-plaintext highlighter-rouge">mapPartitions</code>.</p>
  </li>
  <li>
    <p>I use the inbuilt python <a href="https://docs.python.org/3/library/functools.html#functools.partial"><code class="language-plaintext highlighter-rouge">partial()</code> function</a> from the <code class="language-plaintext highlighter-rouge">functools</code> module.</p>

    <blockquote>
      <p>The <code class="language-plaintext highlighter-rouge">partial()</code> is used for partial function application which “freezes” some portion of a function’s arguments and/or keywords resulting in a new object with a simplified signature.</p>
    </blockquote>
  </li>
  <li>Here is how my final function looks:</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">build_index_get_similar_items</span><span class="p">(</span><span class="n">vector_df</span><span class="p">,</span> <span class="n">index_file</span><span class="p">,</span> <span class="n">top_n</span><span class="p">):</span>
    <span class="n">vectors</span> <span class="o">=</span> <span class="n">vector_df</span><span class="p">.</span><span class="n">vector</span>
    <span class="n">sparkvector_ids</span> <span class="o">=</span> <span class="n">sc</span><span class="p">.</span><span class="n">parallelize</span><span class="p">(</span><span class="n">vector_df</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>

    <span class="c1"># build and save index
</span>    <span class="n">index_file</span> <span class="o">=</span> <span class="n">build_save_annoy_index</span><span class="p">(</span><span class="n">vectors</span><span class="p">,</span> <span class="n">dim</span><span class="p">,</span> <span class="n">index_file</span><span class="p">,</span> <span class="n">BASE_DIR</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">index_file</span><span class="p">)</span>

    <span class="c1"># add index file to the driver files
</span>    <span class="n">sc</span><span class="p">.</span><span class="n">addFile</span><span class="p">(</span><span class="n">index_file</span><span class="p">)</span>

    <span class="c1"># get similar items
</span>    <span class="n">find_neighbours_</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span><span class="n">find_neighbours</span><span class="p">,</span> <span class="n">index_file</span><span class="p">,</span> <span class="n">top_n</span><span class="p">,</span> <span class="n">dim</span><span class="p">)</span>
    <span class="n">similar_items</span> <span class="o">=</span> <span class="n">sparkvector_ids</span><span class="p">.</span><span class="n">mapPartitions</span><span class="p">(</span><span class="n">find_neighbours_</span><span class="p">).</span><span class="n">collect</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">similar_items</span>
</pre></td></tr></tbody></table></code></pre></figure>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">build_save_annoy_index()</code> method builds the index, saves it to a file, and returns the file path.</li>
  <li>Finally, we see the use of <code class="language-plaintext highlighter-rouge">sc.addFile(index_file)</code>.</li>
  <li>The <code class="language-plaintext highlighter-rouge">find_neighbours_()</code> is the partial function. We froze the <code class="language-plaintext highlighter-rouge">index_file</code>, <code class="language-plaintext highlighter-rouge">top_n</code>, and <code class="language-plaintext highlighter-rouge">dim</code>. This function now only expects a single RDD as input. And this is what we wanted for the <code class="language-plaintext highlighter-rouge">mapPartitions()</code> method.</li>
</ul>

<h3 id="saving-results">Saving Results</h3>

<ul>
  <li>I take the <code class="language-plaintext highlighter-rouge">similar_items</code> list and convert it into a pandas DataFrame.</li>
  <li>Map ALL the Annoy index ids with the actual item ids. That includes all the index ids of the top-n similar items list.</li>
  <li>Convert the pandas DataFrame to a PySpark DataFrame.</li>
  <li>Save the PySpark DataFrame into a delta table.</li>
</ul>

<h2 id="result">Result</h2>

<ul>
  <li>I was able to parallelise the kNN search based on Annoy using <code class="language-plaintext highlighter-rouge">mapPartitions</code>.</li>
  <li>On ~500k records, the run time was down from 8 minutes to 2 minutes.</li>
  <li>On ~10 million records (with an index built from ~500k records), the run time was ~1 hour.</li>
  <li>On ~10 million records (with an index built from ~10 million records), I got an OOM error. 🥲</li>
</ul>

<h2 id="whats-next">What’s Next</h2>

<ul>
  <li>Find the reason it is going OOM.</li>
  <li>Converting the pandas DataFrame to PySpark DataFrame is expensive. I want to explore if I can directly go from pandas DataFrame to the delta. Ref: <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_delta.html">1</a>, <a href="https://stackoverflow.com/a/72759021/2650427">2</a>.</li>
  <li>Since the Scikit has vectorised training and inferencing, its kNN would likely be faster. This <a href="https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-Scikit-learn-visualizing-eigenvectors-and-fun/">post</a> shows how to do it. I would probably replace the <code class="language-plaintext highlighter-rouge">map</code> with <code class="language-plaintext highlighter-rouge">mapPartitions</code>.</li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="Python" /><summary type="html"><![CDATA[Alternate title: k-Nearest Neighbours (kNN) in PySpark]]></summary></entry><entry><title type="html">Personalising the Swiggy Homepage Layout - Part I</title><link href="https://trigonaminima.github.io/2022/06/swiggy-homepage-1/" rel="alternate" type="text/html" title="Personalising the Swiggy Homepage Layout - Part I" /><published>2022-06-16T00:00:00+00:00</published><updated>2022-06-16T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/06/swiggy-homepage-1</id><content type="html" xml:base="https://trigonaminima.github.io/2022/06/swiggy-homepage-1/"><![CDATA[]]></content><author><name>Shivam Rana</name></author><category term="work" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Personalising the Swiggy Homepage Layout - Part II</title><link href="https://trigonaminima.github.io/2022/06/swiggy-homepage-2/" rel="alternate" type="text/html" title="Personalising the Swiggy Homepage Layout - Part II" /><published>2022-06-16T00:00:00+00:00</published><updated>2022-06-16T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/06/swiggy-homepage-2</id><content type="html" xml:base="https://trigonaminima.github.io/2022/06/swiggy-homepage-2/"><![CDATA[]]></content><author><name>Shivam Rana</name></author><category term="work" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Workation in Pondicherry 📍</title><link href="https://trigonaminima.github.io/2022/04/pondi/" rel="alternate" type="text/html" title="Workation in Pondicherry 📍" /><published>2022-04-28T00:00:00+00:00</published><updated>2022-04-28T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/04/pondi</id><content type="html" xml:base="https://trigonaminima.github.io/2022/04/pondi/"><![CDATA[<h3 id="-28th-april">📆 28th April</h3>

<p><strong>⌛ 10:00 PM</strong></p>
<ul>
  <li>Started for the bus boarding point.</li>
  <li>Bus was 15 minutes early for the first time ever. Asked the cab driver to speed it up.</li>
</ul>

<p><strong>⌛ 10:30 PM</strong></p>
<ul>
  <li>Boarded the bus. It started within a minute.</li>
  <li>After getting on the bus, gave a status update to mum 🙋‍♀️, A friend also called later.</li>
  <li>Slept at ⌚ 12:20 AM.</li>
</ul>

<p><br /></p>

<h3 id="-29th-april">📆 29th April</h3>

<p><strong>⌛ 06:00 AM</strong></p>
<ul>
  <li>Woke up.</li>
  <li>Watched the sunrise from the bus. The original plan was to go to the <a href="https://goo.gl/maps/vXFeqa9JpENj5bGU9">Rock Beach 📍</a> from Pondicherry 📍 bus stand and watch the sun rise.</li>
  <li>Pinged a few friends about my arrival.</li>
  <li>I got worried when the driver took a different route. He later converged on the correct route.</li>
</ul>

<p><strong>⌛ 07:30 AM</strong></p>
<ul>
  <li>State bus for <a href="https://goo.gl/maps/REUTbvu8zrWP4w8P6">Auroville beach 📍</a>. Rs. 12 ticket. Empty seat.</li>
  <li>Feeling the heat a bit.</li>
</ul>

<p><strong>⌛ 08:00 AM</strong></p>
<ul>
  <li>Walked to <a href="https://goo.gl/maps/69tmUS8LN1iTwvwJA">Vagabond 📍</a> from bus stand.</li>
  <li>Sriram 🙋‍♂️, the property manager, showed me to the top bunk bed on the ground floor.</li>
  <li>Met Saiprasanth 🙋‍♂️ and Santosh 🙋‍♂️.</li>
  <li>Brushed my teeth and went to the terrace.</li>
  <li>Met two new people from Bengal/Odisha. They both have a recruiting company in Bangalore. They came 30 minutes before me. They had booked the Vagabond Airbnb.</li>
  <li>Felt hungry. Sriram 🙋‍♂️ suggested <a href="https://goo.gl/maps/W9ueUPiDEG7mhiXW8">Auroville Bakery 📍</a>.</li>
</ul>

<p><strong>⌛ 08:20 AM</strong></p>
<ul>
  <li>Took one of the property bicycles to the bakery. Cycled ~2 KMs.</li>
  <li>Did not want to experiment. So ordered Idli and Red Rice Cheese Dosa. The dosa was not good. :/</li>
  <li>Breakfast was filling.</li>
  <li>Came back feeling super hot.</li>
</ul>

<p><strong>⌛ 09:20 AM</strong></p>
<ul>
  <li>Met Tanusha 🙋‍♀️ and Anushka 🙋‍♀️.</li>
  <li>Explored the property’s garden and terrace.</li>
  <li>So many <a href="https://en.wikipedia.org/wiki/Jackfruit">Jackfruit</a> trees. A few mango trees as well.</li>
  <li>Worked on the Swiggy blog for a few minutes.</li>
</ul>

<p><br /></p>

<h3 id="-3rd-may">📆 3rd May</h3>

<p><strong>⌛ 10:30 PM</strong></p>
<ul>
  <li>Came to the terrace.</li>
  <li>Migrated the blog google analytics tag.</li>
  <li>Started Pondi travelogue. Published it.</li>
</ul>

<p><br /></p>

<h3 id="-4th-may">📆 4th May</h3>

<p><strong>⌛ 12:30 AM</strong></p>
<ul>
  <li>Went down to to sleep.</li>
  <li>Read <a href="https://amzn.to/3MGnki6">Name, Place, Animal, Thing 📖</a> for 30 minutes.</li>
</ul>

<p><strong>⌛ 09:00 AM</strong></p>
<ul>
  <li>Finished Anki.</li>
  <li>Breakfast: mango shake and 1 chila. Naveen 🙋‍♂️ prepared it for everyone at the hostel. Amazing guy. He loves to do this. This is a very different feeling for me. A good different.</li>
</ul>

<p><strong>⌛ 10:00 AM</strong></p>
<ul>
  <li>Started work. Working on a terrace is a mind blowing experience.</li>
</ul>

<p><strong>⌛ 01:30 PM</strong></p>
<ul>
  <li>Went to Sustenance Cafe with Naveen 🙋‍♂️ on Santosh 🙋‍♂️’s scooty.</li>
  <li>I had a South Indian meal (it was good) for 150 bucks. Naveen 🙋‍♂️ had 3 chapati and sabji for 90 bucks (not so good).</li>
  <li>I met a French guy named Benoit 🙋‍♂️. He is on a two-week long tour in India. He’s going to go to Mysore tomorrow. From India, he is going to London. He is a technology writer. I told him about Swiggy.</li>
  <li>We shared Instagram deets.</li>
  <li>Inquired about a scooty for us. Rs 3000 for a Pep for a month. Rs 3500 for an Access for a month. Naveen 🙋‍♂️ and I will share it till he is here. Post that it will be with me.</li>
  <li>We will get it in the evening.</li>
</ul>

<p><strong>⌛ 02:30 PM</strong></p>
<ul>
  <li>Back to working on the terrace.</li>
  <li>In a meeting, my boss asked me how my workation was going. Told him all the fun things I did.</li>
</ul>

<p><strong>⌛ 06:00 PM</strong></p>
<ul>
  <li>Fellow hostelers left for a swim on the <a href="https://goo.gl/maps/z5KNS3Uwcagjub766">Serenity Beach 📍</a>. I was in a meeting at that time. But they were thoughtful enough to leave the key to a scooty for me.</li>
  <li>I drove it to the beach without any incident. It was the first time I had ridden a scooty on the main road. Traffic was light so it was easy. I am feeling more confident now. I forgot the road leading to the beach, but a few shopkeepers helped.</li>
  <li>Joined these folks. It was a low tide but waves were big enough for light surfing. Sriram 🙋‍♂️ was on his surf board. Santosh 🙋‍♂️, Naveen 🙋‍♂️, and I swam in the water.</li>
  <li>I had a goal of going for a swim three times this week. I achieved that.</li>
</ul>

<p><strong>⌛ 08:36 PM</strong></p>
<ul>
  <li>Went to <a href="https://goo.gl/maps/Z69pcAf4sKALtvhM9">Jeffi Restaurant 📍</a> for dinner. With Naveen 🙋‍♂️, Santosh 🙋‍♂️, and Ved 🙋‍♂️. This place was suggested by Sriram 🙋‍♂️.</li>
  <li>Finished the objective of visiting three new restaurants.</li>
  <li>Today is Ved 🙋‍♂️’s last night in Pondicherry 📍.</li>
  <li>We had tandoori chicken, chettined chicken, phulka, and butter roti. Spent Rs 915.</li>
  <li>Went to <a href="https://g.page/richy-rich-ice-creams?share">Richy Rich Ice Creams 📍</a>. Ate a 120 bucks scoop of Belgian Almond Chocolate ice cream.</li>
</ul>

<p><strong>⌛ 10:00 PM</strong></p>
<ul>
  <li>Reached the <a href="https://goo.gl/maps/69tmUS8LN1iTwvwJA">Vagabond 📍</a>.</li>
  <li>Came to terrace and relaxed till ⌚ 2 AM. Slept by ⌚ 2:20 AM.</li>
</ul>

<p><br /></p>

<h3 id="-5th-may">📆 5th May</h3>

<p><strong>⌛ 09:15 AM</strong></p>
<ul>
  <li>Woke up. Finished Anki.</li>
  <li>Naveen 🙋‍♀️ had texted saying that two chocolate croissants 🥐 where kept on the dining table. Again, so thoughtful of him.</li>
</ul>

<p><strong>⌛ 10:25 AM</strong></p>
<ul>
  <li>Finished my croissants 🥐. Not warm, but still tasted good.</li>
  <li>MacBook started updating. Useless laptop. Finally, finished updating right before I had to take an interview. Phew.</li>
</ul>

<p><strong>⌛ 01:10 PM</strong></p>
<ul>
  <li>I wanted to try out <a href="https://goo.gl/maps/xBfD8ogHXNd1Badr7">Sree Andhra Tiffins 📍</a>. Others didn’t want to go out. So, went alone.</li>
  <li>Went by scooty. I was able to maneuver through the traffic, red lights, and local Pondicherry market. Felt accomplished.</li>
  <li>Plan of gaining confidence in riding a 2-wheeler is going well.</li>
</ul>

<p><strong>⌛ 01:35 PM</strong></p>
<ul>
  <li>Ate a classic meal. It was a large and sumptuous meal 😋.</li>
  <li>It was served on a banana leaf. It contained: three dishes, rice + podi + ghee, pickled garlic, rice + sambhar, rice + rasam, rice + curd curry, papadam, and mango slices.</li>
  <li>All this for 110 bucks.</li>
  <li>Got a meal packed for others.</li>
  <li>The person sitting opposite to me has been eating at this place since 2008. Whenever he visits to Pondicherry 📍, he always eats his meals here.</li>
</ul>

<p><strong>⌛ 02:02 PM</strong></p>
<ul>
  <li>I had to come back to the hostel by ⌚ 2:00 PM for a meeting. Unfortunately, I was still at the restaurant.</li>
  <li>Joined the call from phone listening everything on my earphones. Hard to manage because I had to stop at places to participate.</li>
  <li>Reached back at ⌚ 2:30 PM</li>
</ul>

<p><strong>⌛ 05:30 PM</strong></p>
<ul>
  <li>Met Dinesh 🙋‍♂️. He checked-in today. He drove from Bangalore 📍 to Pondicherry 📍. He is an iOS developer.</li>
</ul>

<p><strong>⌛ 06:20 PM</strong></p>
<ul>
  <li>Reached the <a href="https://goo.gl/maps/z5KNS3Uwcagjub766">Serenity Beach 📍</a> with Naveen 🙋‍♂️, Santosh 🙋‍♂️, Sriram 🙋‍♂️, and Dinesh 🙋‍♂️.</li>
  <li>I drove the scooty with Naveen 🙋‍♂️ as a pillion rider. 💪</li>
  <li>I swam three long laps. I caught two waves 🏄‍♂️ with Sriram 🙋‍♂️’s help.</li>
</ul>

<p><strong>⌛ 08:20 PM</strong></p>
<ul>
  <li>Naveen 🙋‍♂️ had started cooking 🧑‍🍳.</li>
  <li>He was preparing <a href="https://en.wikipedia.org/wiki/Okra">okra</a> dish.</li>
  <li>I assisted him. I cut onions. I prepared <a href="https://en.wikipedia.org/wiki/Chapati">chapati</a> <a href="https://en.wikipedia.org/wiki/Dough">dough</a>. I also prepared banana shake. Both of us prepared 9 chapatis.</li>
  <li>It took us ⌚ 1.5 hours to prepare the whole meal 🧑‍🍳. We talked about multiple things while cooking.</li>
  <li>Five of us ate together. It was nice.</li>
</ul>

<p><strong>⌛ 10:00 PM</strong></p>
<ul>
  <li>Told my mum 🙋‍♀️ about today.</li>
  <li>Talked to Naveen 🙋‍♂️ about his CFA certification.</li>
  <li>Wrote today’s tavelogue.</li>
</ul>

<p><br /></p>

<h3 id="-6th-may">📆 6th May</h3>

<p><strong>⌛ 12:00 AM</strong></p>
<ul>
  <li>Listening to Santosh 🙋‍♂️ and Dinesh 🙋‍♂️ talking. They are drunk.</li>
  <li>Dinesh 🙋‍♂️ is going to resign on Monday and move to Canada in August.</li>
  <li>Dinesh 🙋‍♂️ went to get more beers. Santosh 🙋‍♂️ started talking about <a href="https://goo.gl/maps/69tmUS8LN1iTwvwJA">Vagabond 📍</a>. He says, “Every week I think about moving to other city or place, but then I don’t.” He goes on: “It is never the place. It is always the people. People make it worthwhile. I always meet someone new.”</li>
  <li>He came to Pondicherry for only 2 days. It is his third month.</li>
  <li>One of the friends I knew before I came to Bangalore told me that if you don’t have an agenda of why you want to come here then this place is going to grow on you. I wonder how it will be.</li>
</ul>

<p><strong>⌛ 02:10 AM</strong></p>
<ul>
  <li>Time to crash, but went down and read <a href="https://amzn.to/3MGnki6">Name, Place, Animal, Thing 📖</a>.</li>
</ul>

<p><strong>⌛ 09:00 AM</strong></p>
<ul>
  <li>Woke up, and did the usual. Work started early due to a few P0 issues.</li>
  <li>Market low. “Buy on dip.”</li>
</ul>

<p><strong>⌛ 01:40 PM</strong></p>
<ul>
  <li>Lunch @ <a href="https://goo.gl/maps/ArLRMjDa7SvXbqVF9">Baghiraa Cafe &amp; Co Work 📍</a> with Naveen 🙋‍♂️.</li>
  <li>Bought a few personal care items (shower gel, and moisturizer) from <a href="https://goo.gl/maps/SZX1YUMHdsbxfFSHA">Farm Fresh 📍</a>.</li>
  <li>Came back to the hostel. Of course, I was using scooty for all the transportation.</li>
</ul>

<p><strong>⌛ 06:00 PM</strong></p>
<ul>
  <li>Biked 🚲 to <a href="https://goo.gl/maps/z5KNS3Uwcagjub766">Serenity Beach 📍</a>. Finished four laps today. 💪</li>
  <li>Came back on scooty with Santosh 🙋‍♂️ while Naveen 🙋‍♂️ cycled back.</li>
</ul>

<p><strong>⌛ 08:45 PM</strong></p>
<ul>
  <li>Ragavan 🙋‍♂️ came to pick me up. (context: Ragavan is a dear friend from school. He was in Pondi today, so we planned to catch-up.)</li>
  <li>We went to <a href="https://goo.gl/maps/JXV8DxeBv7bQkP729">Cafe 73 📍</a> on his cousin’s bullet. Coincidence was that Cafe 73 is a bike themed cafe. There is a bike on display when you enter. All the walls have some bike art. Sriram 🙋‍♂️ had suggested to try out non-veg burgers 🍔 from here. Ragavan ordered The Ferrari F40 and I ordered The Turbo Charger. And burgers were succulent.</li>
  <li>We started with work and life. We discussed Pondi and nearby places. We discussed my city hopping plan. We ended with him inviting me to his place in Chennai on a weekend.</li>
  <li>We went to the <a href="https://goo.gl/maps/z5KNS3Uwcagjub766">Serenity Beach 📍</a> to relax. Sat on the rocks. Listened to the waves crashing against other rocks. Talked about random things.</li>
  <li>It took me back to 2018. Ragavan and I were in <a href="https://goo.gl/maps/y3nA3wvqi3LKqiCD8">Mysore 📍</a>. We were meeting after 6-7 years. We went to Mysore to attend the <a href="https://en.wikipedia.org/wiki/Mysore_Dasara">Mysore Dasara</a> celebrations. It ends with the whole palace lighted with <a href="https://en.wikipedia.org/wiki/Mysore_Dasara#Lightings_in_Mysore_Palace">100,000</a> light bulbs. After it ended, the workers started packing up all the chairs and tents. We were sitting on a pair of chairs in the last row. And we started going through the 7 years worth of events from each other’s life. Both of us, vicariously experienced each other’s life. It was an amazing end to our days.</li>
  <li>He dropped me back to <a href="https://goo.gl/maps/69tmUS8LN1iTwvwJA">Vagabond 📍</a>.</li>
</ul>

<p><strong>⌛ 11:00 PM</strong></p>
<ul>
  <li>Came on terrace to write this log. Santosh 🙋‍♂️, Dinesh 🙋‍♂️, and Tanusha 🙋🏻‍♀️ were already here.</li>
  <li>Talked to a friend about product management and design.</li>
  <li>Six college students joined us. They were staying at the Airbnb.</li>
  <li>Suddenly, Gales started. It rained a bit. And then there was a power cut. The climate became cool and refreshing. No humidity in the air.</li>
  <li>Asked Santosh 🙋‍♂️ what else defines him besides being a fitness coach. He took time before answering. His answer was: inventions, start ups, and product design. He had a course about product design. He wants to build a product in fitness domain.</li>
  <li>He is traveling since November 2021. He had to come back from Dubai due to Covid related incidents in family. After settling everything, he couldn’t go back to Dubai. He started traveling: <a href="https://goo.gl/maps/EsN2hsXwJ37KUyY97">Rishikesh 📍</a>, <a href="https://goo.gl/maps/y3nA3wvqi3LKqiCD8">Mysore 📍</a> and now, here.</li>
</ul>

<p><br /></p>

<h3 id="-7th-may">📆 7th May</h3>

<p><strong>⌛ 02:45 PM</strong></p>
<ul>
  <li>Started working on this log.</li>
  <li>Off to sleep.</li>
</ul>

<p><strong>⌛ 12:30 PM</strong></p>
<ul>
  <li>Just woke up. Those six college students left.</li>
  <li>What are you going to do after your CFA exam on 24th? Naveen answered that he is going to stay home and earn money till the results come. Mode of earning money: trading. He was confident about it. If he had the freedom, he would have been traveling all around to play all kinds of sports.</li>
  <li>He volunteered at <a href="https://g.page/SurfingIndia?share">Mantra Surf Club 📍</a> earlier this year. Coincidently, my beginner surfing course was also at Mantra. We talked about the instructors and some surfing tricks. We may go for surfing next week. ‘tis the surfing season in Pondi.</li>
  <li>He has also done the beginner and intermediate Skiing courses from <a href="https://g.page/JIMWS?share">JIM &amp; WS 📍</a>. I have been trying to plan for the same course since last year. Hopefully, next year. 🤞🏼</li>
</ul>

<p><strong>⌛ 01:45 PM</strong></p>
<ul>
  <li>Lunch @ <a href="https://goo.gl/maps/A7Df1WL922ikqsu58">Neem Tree Cafe 📍</a> with Naveen. I had Red Rice and Dosa meal and he had chapati sabzi meal. It was disappointing. Both the meals had the same dishes. That meant I had to eat my dosa with lauki (gourd) instead of sambhar or some chutney. This restaurant was inside <a href="https://goo.gl/maps/QkdZL3P4HXQiGq859">Auroville 📍</a>. Not sure what to expect from the other Auroville restaurants in my list.</li>
  <li>Came back and slept again. Today was a slow day.</li>
</ul>

<p><strong>⌛ 06:00 PM</strong></p>
<ul>
  <li>Biked 🚲 to <a href="https://goo.gl/maps/z5KNS3Uwcagjub766">Serenity Beach 📍</a>. Finished four laps again. 💪</li>
</ul>

<p><strong>⌛ 08:15 PM</strong></p>
<ul>
  <li>Dinner @ <a href="https://goo.gl/maps/ArLRMjDa7SvXbqVF9">Baghiraa Cafe &amp; Co Work 📍</a> with Naveen and Santosh.</li>
  <li>We discussed workout and nutrition with Santosh. I wanted to know how to tighten the skin around my abs. Answer: Planks and <a href="https://www.youtube.com/watch?v=zcCFmBkJxLM">Wall Press Dead Bug</a>. We talked about eating sugars. He also mentioned the importance of fat in the body. Apparently, leanest body will require 6-7% fat in the body. Normal is ~20%. Looking at my physique, he said I would be under 20%. Anything above 15% and under 25% is good.</li>
  <li>I gotta take a Nutrition 101 course.</li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="city-hopping" /><summary type="html"><![CDATA[📆 28th April]]></summary></entry><entry><title type="html">All the Small Things</title><link href="https://trigonaminima.github.io/2022/04/tm-speech-3/" rel="alternate" type="text/html" title="All the Small Things" /><published>2022-04-19T00:00:00+00:00</published><updated>2022-04-19T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/04/tm-speech-3</id><content type="html" xml:base="https://trigonaminima.github.io/2022/04/tm-speech-3/"><![CDATA[<p>I delivered this speech on Sunday (17th Apr) at a Toastmasters Meeting. The objective was to introduce body language and vocal variety in my delivery.</p>

<!-- --- -->

<p><br /></p>

<p>A church! (Cross gesture) Jesus, pls help me find a good flat.
Ooh, ooh. A temple. (Praying hands) God, pls help me find a good flat within my budget.
Wow! A mosque too! (Praying hands) ya Allah, pls help na.</p>

<p>That was my first week in Bangalore.</p>

<p>Fellow TMs and guests. Have you ever longed for the past? Have you ever wished to experience a memory again? Have you ever felt nostalgic?</p>

<p>Nostalgia is a curious feeling. Most of the memories are trivial. But, they always bring a deluge of thoughts with them. Today, I am going to talk about three specific triggers from my life. First, music and how it makes me feel proud of myself. Second, the summer season and its relaxing effect on me. Lastly, my first week in Bangalore and how it motivated me to start a new thing.</p>

<p>So, come, let’s take a trip down my memory lane. (smile)</p>

<p>Music.</p>

<p>(Singing like a metal head) “Whiskey in the Jar-o” is a song by Metallica. I first heard that song at 2 am on the radio. I was “studying” for my board exams. Metallica reminds me of how I became a night owl. I have pulled off numerous all-nighters by now. For me, it is the best time of the day.</p>

<p>A friend introduced me to Coldplay. This band reminds me of my crush and how I was too chicken to pursue her. I am more daring now. And, I wish I could go back.</p>

<p>Finally, the Explosions in the Sky. This instrumental band accompanied me on all my late-night walks. Walking did not help me shed my chubbiness. But, it did have a therapeutic effect on me. To this day, whenever I want to clear my mind, I go on a late-night walk while listening to the Explosions in the Sky.</p>

<p>Songs of these bands always remind me of who I was. And where I am now. I am sure that the ambitious younger me would be elated to see the current “me,” equally ambitious. And that always makes me feel proud of myself.</p>

<p>Summer.</p>

<p>Since first April, the heat has picked up in Delhi, and my mum has been relentlessly cribbing about it. She says, irritated, “It is too hot!. Summer is early this year. Why does it even exist?” I, on the other hand, enjoy Summer. So we go into this banter of which season is best. Of course, I don’t need nostalgia to remind me of this banter. My mum does that every April.</p>

<p>But, summer reminds me of very distinct periods of my life.</p>

<p>Imagine this: The Sun is hot. You feel needles pricking you all over. You take your clothes off and jump in the cold, clean water of Yamuna. After a long dip, you go to a nearby field, pluck a big watermelon and gobble it up. Summer reminds me of that 7-year-old me.</p>

<p>School days. How many people here remember the games period? Huh? How about school buses? I used to walk home from school. Hopping from the shadow of one tree to another, only stepping into the sun when there was no shadow remaining. Upon reaching home, I used to build cool stuff from the broken things at home. Summer reminds me of those carefree school days.</p>

<p>Summer also reminds me of the 1st and 2nd waves of Covid. I stayed in Bangalore. And so did my flatmates. We cooked together. My chapati making skill is top-notch now. We played inside our home. We worked out together. One flatmate and I shaved our heads together. And the cherry on top was how we ended the summer with a carefully planned road trip to the Western Ghats.</p>

<p>Summers are truly wonderful.</p>

<p>My first week in Bangalore.</p>

<p>It happened on the 3rd or 4th day. I was flat hunting and decided to explore Bangalore on foot. Seeing those three holy places in the vicinity of each other made me realize how incredible India is. And how, it turned out, it was an equally incredible experience for me.</p>

<p>I am about to start an experiment in the coming weeks. It is going to require a titanic change in my lifestyle. Frankly, I am scared to even think about it. Thankfully, this fear helped me. It reminded me of my first week in Bangalore. I remembered how much I enjoyed being on my own. For the first time, I was not under my parent’s care. I was making all the decisions. Instead of being afraid, I was super excited! Like the cold seeps through a cracked door, excitement seeped into me. That unexpected nostalgic moment motivated me to go ahead and take that step.</p>

<p>It is the smallest thing, eating a watermelon at a riverbank. But it is shocking how I want to do it again and again today. Likewise, I could have never imagined that my chaotic first week in Bangalore would motivate me in future.</p>

<p>We seldom get to experience nostalgia in our busy lives now. Hence, here is something I want you to do after this meeting. Go reminisce about the past. Past that fills you with warm and fuzzy feelings. And if you feel you do not have such a memory, I implore you to make them. It does not matter how significant or insignificant they are.</p>

<p>Because you never know which tiny event would make you smile, help you discover yourself, or later motivate you to take the next step.</p>]]></content><author><name>Shivam Rana</name></author><category term="communication" /><summary type="html"><![CDATA[I delivered this speech on Sunday (17th Apr) at a Toastmasters Meeting. The objective was to introduce body language and vocal variety in my delivery.]]></summary></entry><entry><title type="html">Toastmasters Speech with a Purpose</title><link href="https://trigonaminima.github.io/2022/03/tm-speech-2/" rel="alternate" type="text/html" title="Toastmasters Speech with a Purpose" /><published>2022-03-10T00:00:00+00:00</published><updated>2022-03-10T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/03/tm-speech-2</id><content type="html" xml:base="https://trigonaminima.github.io/2022/03/tm-speech-2/"><![CDATA[<p>Today, I am going to discuss my second Toastmasters speech. The first speech is here.</p>

<p>The second project was to deliver a speech having a purpose. The Icebreaker project (1st speech) gets you started on the path of public speaking. The second project makes you understand the importance of purpose in any communication.</p>

<p>The upcoming section contains my final speech. I expand on the purpose and its importance after that. Then I discuss my takeaways from this project, followed by the conclusion.</p>

<p><br /></p>

<h2 id="explore-or-exploit">Explore or Exploit</h2>

<p>Can I see a show of hands if anyone here orders food from Swiggy or Zomato? (wait) Thank
Now let me see a show of hands from those who ALWAYS order from the same restaurant. (wait) [No one. That is obvious. Right? You will also try out new restaurants.]
This is the last one. Let me see a show of hands from those who ALWAYS try out a new restaurant. (wait) (smile) [That was another crazy question.]</p>

<p>Most of us are the same: we either want to order from our safe zone or try from uncharted territory. Listening to it in this context may sound obvious. After all, you should have both choices available to you. How else are you going to find a new place?</p>

<p>Unfortunately, we often miss this in the context of our life. Maintaining a balance between exploration and exploitation is the message I want to convey today.</p>

<p>I started exploring cycling when I was 9 or 10 years old. I did not plan on being the third wheel between my cousin and his bike. So, I learnt it in a few days and switched into the exploit mode of riding it daily.</p>

<p>After gaining confidence, I decided to ride my elder brother’s bike. The next day, in the early morning, I pondered about taking the risk. Why would it be risky? Because when on the bike, my toes never reached the floor.</p>

<p>You know the urge to do something that people older than you are doing can make a child do anything. I decided to take the risk.</p>

<p>I could maintain the balance. As I could only pedal in half circles, momentum was building up slowly. Now, of course, something bad happened. I suddenly hit one of those nasty speed breakers. The likes of which, along with your speed, also break you. After the impact, I found myself off the saddle and hanging on the frame. My feet were nowhere near the pedals and scrambling for support from the road. Since my toes could barely reach the road, I knew if I hit the brakes now, I would fall and hurt myself. At some distance, I saw a bunch of hay on the footpath. I manoeuvred the bike to break my fall on that cushion. There were a few laughs from the bystanders. I also laughed with them.</p>

<p>In retrospect, it was a tiny incident, but for a 10-year-old kid, it was a confidence booster. I remember feeling proud of myself for coming out of it without any injury. And I was glad I decided to explore riding on my brother’s bike.</p>

<p>The explore-exploit equation gets warped at school.</p>

<p>You are required to study a fixed set of subjects. Under the influence of that mandate, I never thought to explore any of my subjects. That thinking changed when a tutor got me into the habit of reading my textbooks. After that, subjects were not something that I had to study. They became different avenues for me to explore.</p>

<p>My favourite subject was physics. I was obsessed with Einstein and Feynman. I loved reading about the theory of relativity. Schrödinger’s cat made atoms and waves exciting. Concepts of thermodynamics made life practical. I even envisioned the future Shivam building a successful nuclear fusion reactor to satisfy all our clean electricity needs.</p>

<p>I was also caught by the outer space bug. Here is a trivia that blew my mind: you witness the past when you see the stars during the night. That is because they are so far that it took light years to reach you.</p>

<p>Then I got to explore, not study, but explore computer science in 11th grade. You could code anything up if you had the creativity and a logical mind. I was immediately hooked. I romanticize those years as a time when I had a love triangle with Physics and Computer Science.</p>

<p>Of course, every good thing comes to an end. When it was time for college, I defaulted to exploitation. The reality, seen through the eyes of peers, teachers, and parents, persuaded me to opt for Computer Engineering.</p>

<p>The thing is, I have never regretted this decision. There were numerous subfields to explore within computer science. My career shaped into what it is today because of those explorations.</p>

<p>I started my Computer Science engineering in the year 2012. Data Science and Artificial Intelligence started booming. Data Science suggests that we can solve problems in any domain when simple maths is applied to data. That was fascinating to me. It also had many concepts of Physics, my first love, mixed in with it.</p>

<p>I explored many interesting subfields in computer science before settling for Data Science. When the question came to decide on a career, I had already covered a lot of ground in Data Science. And its interdisciplinary nature compelled me to exploit it further.</p>

<p>I give full credit to my explorations for Data Science as a career choice. It has given me exposure to different industries and how the world works. And deciding which industry to try next is also always driven by my drive to explore.</p>

<p>I am not talking about Covid here, but I feel we are undergoing a pandemic where people lack the approach to find what they can exploit. And that is because they do not explore enough.</p>

<p>I know many people who have taken this framework to the extremes and are incredibly successful and satisfied.</p>

<p>So my message for you today is to explore as much as you can. The moment something enthrals you, go into exploitation mode. But never stop exploring. Exploration is what makes you, you.</p>

<p><br /></p>

<h2 id="purpose-and-its-importance">Purpose and its Importance</h2>

<p>Books have a purpose; this blog post has a purpose; like every communication, a speech also has a purpose. We are instinctively aware of this whenever we say something.</p>

<p>Nevertheless, I was perplexed when I started writing this speech. I had a vague idea of what I wanted to talk about, but I could not articulate it. Response from my mentor also did not help me. The project resources provided by Toastmasters succoured me.</p>

<p>According to the guide, every speech has a <strong>generic purpose</strong>: <em>inform</em>, <em>persuade</em>, <em>entertain</em>, and <em>inspire</em>. Every generic purpose will also have a <strong>specific purpose</strong>. It is one sentence that summarizes the objective of your speech. The general purpose of my speech was <strong>to inspire</strong>. The specific purpose was: <strong>the more you explore, the more you build your character</strong>.</p>

<p>This framework of defining the purpose enabled me to write clearly. I knew how I wanted the audience to perceive my speech. I wrote all the paragraphs and transitions between them with my purpose in mind. I enjoyed writing this speech.</p>

<h2 id="takeaways">Takeaways</h2>

<h3 id="posture">Posture</h3>

<p>My posture was good throughout the speech. I had a <strong>natural presence</strong> and <strong>felt confident</strong>.</p>

<p>Unlike during the first speech, my neck was straight. Practising in front of the mirror probably helped there. This time around, I observed that my shoulders were not level. It is a tiny thing and may not be noticeable to the audience. I will work on this during my next practice.</p>

<p>I was also <strong>tiptoeing</strong> during many parts of the speech. My head bobbed up and down a few times. It was unnecessary.</p>

<p>One feedback I got from the evaluator was that I looked away from the camera multiple times. It was not distracting, but it was noticeable. It coincided with the parts where I could not remember the content. It also happened during the last speech. The only solution is to practice more.</p>

<p>The second feedback was a lack of hand gestures. I was more focused on remembering the content that I did not think about gestures. My hands were also not in the camera view. So the audience missed the few natural hand gestures that I had towards the end. So I need to work more on hand gestures.</p>

<h3 id="setup">Setup</h3>

<p>I wore formals. I stood up to deliver the speech. The evaluator also mentioned that both the light and the camera angle were good. Notwithstanding, I may have to rethink the setup. My natural hand gestures were not visible because of the camera field view. I will need to increase the distance between the camera and me.</p>

<h3 id="facial-expression">Facial Expression</h3>

<p>I did not have a dry throat this time. Thus, there were no weird expressions. I was blank at some places. Otherwise, I was mostly smiling.</p>

<h3 id="pace">Pace</h3>

<p>I had the impression that I speak fast. After two speeches, I can say that I have an average pace. The pauses between the sentences and paragraphs are also apt. I do not speak at a snail’s pace, neither is it rushed.</p>

<p>Occasionally, I quickly go through a sentence and mess up. Following are the instances I observed:</p>

<ul>
  <li><em>Bystanders</em> sounded like <em>bystandards</em> (or, I may have said bystandards)</li>
  <li><em>Fixed set of subjects</em> sounded like <em>fix ed of subjects</em></li>
  <li><em>Every good thing comes to an end</em> - did not fumble, but it could have been slower.</li>
  <li>I fumbled at <em>regretted</em> while speaking <em>the thing is, I never regretted that decision</em></li>
  <li>I fumbled at <em>Artificial Intelligence</em> while speaking <em>Data Science and Artificial Intelligence started booming</em>.</li>
  <li>I did not emphasize while ending this sentence: <em>proud of myself after handling this situation without any injuries</em>.</li>
</ul>

<h3 id="time">Time</h3>

<p>The goal was to take five to seven minutes. Unfortunately, I was way over the time limit. Despite skipping a few sentences, I spoke for nine whopping minutes. It happened because of a lack of insight into my speaking style. If I had timed myself while practising, I would have noticed it. I decided against it because I wanted to practice the delivery and retention of the content.</p>

<p>I wrote the speech assuming a high pace of speaking. It had more than 900 words. I expected to cover all the material in under seven minutes. I could not. Along with the content length, I also missed accounting for pauses between sentences. I will shorten the write up in future speeches. My mentor concurred.</p>

<p>The evaluator suggested starting wrapping up when the yellow timer card comes up. That will push me to cover the relevant parts to finish up on time.</p>

<h3 id="filler-words">Filler Words</h3>

<p>This time around, I used fewer crutch words (28 this time vs 34 last time). The use of uh remained the same, though. :/ I also noticed a new filler: taking a nose breath (as if I had to clear the sinuses).</p>

<div class="rendered_html">
<table>
<thead>
  <tr>
    <th>Uh</th>
    <th>Like</th>
    <th>Um</th>
    <th>Nose breath</th>
    <th>And</th>
    <th>But</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>12 times</td>
    <td>5 times</td>
    <td>4 times</td>
    <td>4 times</td>
    <td>2 times</td>
    <td>1 times</td>
  </tr>
</tbody>
</table>
</div>

<h3 id="a-mixture-of-the-tenses">A Mixture of the Tenses</h3>

<p>I mixed the tenses only once. Yay.</p>

<p>I mixed up <em>could</em>, <em>would</em>, and <em>will</em>. Hyperlinked are refreshers on <a href="(https://www.butte.edu/departments/cas/tipsheets/grammar/would.html)">could, would, and should</a> and <a href="https://www.inenglishwithlove.com/blog/difference-between-will-and-would">would vs will</a>.</p>

<p>It looks like I make grammatical mistakes when I am recounting past stories. Practice, practice, practice!!</p>

<h3 id="opening">Opening</h3>

<p>The opening of any speech should be grand. It can be a question, a joke, or an eyebrow-raising statement. It will make the audience listen to you.</p>

<p>I created that effect.</p>

<p>The evaluator mentioned that my questions were very engaging. They were engaging because of the following two reasons:</p>

<ol>
  <li>Ordering food online is ubiquitous today. Everyone could respond to the questions.</li>
  <li>The relationship between the questions and the theme was not apparent. That compelled the audience to listen to the next part of my speech.</li>
</ol>

<p>Here is a little truth about that opening: it was not my first choice.</p>

<p>Initially, I had written a technical opening connecting the Recommendation Systems with the topic. I thought it would grab their attention. Unfortunately, my mentor disagreed and suggested changing it.</p>

<p>Then I wrote a dialogue between me and my friend about how exploration and exploitation emerge while ordering food online. According to my mentor, it would work if I was an experienced speaker. As I am a beginner and I will likely butcher the dialogue delivery. Consequently, I settled on asking questions from the audience.</p>

<p>My mentor hinted that I should think of the opening as a grand event. I will write and present it with more effort if I consider it paramount.</p>

<h3 id="delivery">Delivery</h3>

<p>Here is the paraphrased version of feedback from my mentor: Initially, I was not sure why I was doing it. Then I gained the confidence and delivered it confidently.</p>

<p>It was my second speech. The environment was very different from the first speech. So I was shaky in the beginning. The audience also surprised me. The replies to my questions were unexpected. I had to make a few changes on the fly.</p>

<p>My mentor says that surprises are inevitable. I should make impromptu changes from the reactions of the audience.</p>

<p>The evaluator commended me on delivering it naturally. It looked like I was comfortable on the virtual stage. My smiling face helped. The more crucial factor was how my speech came across. To my evaluator, we were talking during my speech. Rather than being a performance, it came across as a conversation with the audience. It points in the direction that I connected with my audience. That too on a virtual stage.</p>

<p>My mentor advised maintaining my conversational delivery. At the same time, I should make it impactful. The impact will come if I put more power in my voice. And consistently maintain it. A guiding principle is to make my voice reach the people in the (imaginary) last row.</p>

<h3 id="content">Content</h3>

<p>The evaluator liked my stories. They particularly enjoyed the flow of topics with the theme: food, cycling, school subjects, science, computer science, and data science. The audience found the analogies pleasing. They found the sprinkled humour diverting (special mention for the <em>love triangle</em> bit). The evaluator also mentioned how technical things mixed with the stories made the speech interesting for <em>non-science</em> people. It was surprising to see myself excelling at telling stories because I consider myself mediocre at it.</p>

<p>Lastly, one audience member remarked that she was motivated to explore more. It made me feel that my speech successfully inspired at least one person.</p>

<h3 id="topic">Topic</h3>

<p>The speech was about exploring stuff. Exploiting the right ones and further exploring them in depth. So the speech title should have been <strong>Explore and Exploit</strong> instead of <strong>Explore or Exploit</strong>.</p>

<h3 id="speaking-vs-writing">Speaking vs Writing</h3>

<p>My mentor gave me an interesting outlook when I asked him about improving the language of my content and delivering it better.</p>

<p>There is a difference between our writing and speaking patterns. Exceptional speakers have consistent in the way they speak and write. Novice speakers like me write in a manner that will be miles away from my speech.</p>

<p>Written content has finesse because we spend more time on it. My blog posts go through multiple rounds of reviews before I hit publish. So, the first step towards having a quality language is to bring my speaking level up to my writing level. Right now, I go off script quite a lot. I have to do that sensibly and not because I forgot what I wrote.</p>

<p>The next frontier is to improve <em>how</em> I write. My mentor encouraged me to discover my ideal writing. It will give me a sense of the effect I want to achieve with each sentence.</p>

<p>The pertinent question here is, how do we find this impactful writing ideal? I have to expose myself to a variety of authors and written content. It is not limited to books. It can be book reviews or other literary articles. For example, I have subscribed to <a href="https://lithub.com/">Literary Hub</a> to get ideas about quality sentence formations. Gradually, I will build a repository of these ideas. Employing them in my writing will take me towards my ideal.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I delivered my second speech. This time my speech was supposed to have a clear purpose. My speech was about the theme of exploration and exploitation in life.</p>

<p>I reproduced all the positive things from last time: camera angle, attire, and expressions. I came across as natural. My pace was right. Pauses between the sentences were natural. I used fewer fillers than the last time. There were a few grammatical mistakes, but nothing major.</p>

<p>My opening was engaging and relatable to the audience. The flow of the speech kept the audience interested. Stories and humour were also well received. I found out that I am good at storytelling. My mentor helped me understand how I can improve the quality of my writing and speeches.</p>

<p>Key improvement areas for me are:</p>

<ul>
  <li>Time management. I need to finish my content within the stipulated time.</li>
  <li>Practice a lot more</li>
  <li>Bring power in my voice and maintain it throughout the speech.</li>
  <li>Bring speaking level closer to writing level.</li>
</ul>

<p>My third speech is about vocal variety and body language in my delivery. I am looking forward to more discoveries.</p>]]></content><author><name>Shivam Rana</name></author><category term="communication" /><summary type="html"><![CDATA[Today, I am going to discuss my second Toastmasters speech. The first speech is here.]]></summary></entry><entry><title type="html">Toastmasters Icebreaker Speech</title><link href="https://trigonaminima.github.io/2022/02/tm-speech-1/" rel="alternate" type="text/html" title="Toastmasters Icebreaker Speech" /><published>2022-02-19T00:00:00+00:00</published><updated>2022-02-19T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/02/tm-speech-1</id><content type="html" xml:base="https://trigonaminima.github.io/2022/02/tm-speech-1/"><![CDATA[<p>Last year in November, I joined <a href="https://www.toastmasters.org/">Toastmasters</a> (TM) to build my communication and leadership skills. One part of the TM is the prepared speeches. You have to prepare and deliver a short speech in front of an audience. You also receive feedback from an evaluator.</p>

<p>I delivered my first speech on 13th Feb 2021. (Yeah, I procrastinate a lot.)</p>

<p>The first speech is supposed to be an icebreaker. You are just starting, and thus, you first tell your club members about you. It is supposed to be 4 to 6 minutes long. Assuming the pace to be between 120 to 140 words per minute, it comes around 500 to 800 words.</p>

<p>My first draft had a little under 1000 words. My mentor asked me to remove two paragraphs. Thus the final speech had somewhere over 800 words.</p>

<p>This post is about the speech and my takeaways.</p>

<p><br /></p>

<h2 id="should-i-stay-or-should-i-go">Should I Stay, or Should I Go?</h2>

<p>I always loved the very first exercise of my school textbooks. They were effortless. I thought my first speech would be the same. Little did I know, Icebreakers are supposed to be about you, and I am always at a loss for words when asked to describe myself. The only time I wrote a bio was on my Tinder profile. I have a suspicion as to why I never got any matches. :)</p>

<p>A few days before Christmas last year, I was oscillating between technical writing and going for a 6-day cycling trip from Dandeli to South Goa. Both had utterly compelling arguments favouring them. After some procrastination and over-thinking, I finally came to my classic quandary: “Should I stay, or should I go?”</p>

<p>And this has been the theme of my life. Today, I wish to show you how my procrastinating, curious, impatient, autodidactic, confident (likely over-confident), risk-taking, and adventurous personality always brings me back to this question.</p>

<p>I was born in MP and grew up in UP and Delhi. A place called Baghpat in UP is my hometown. That is where I learned cycling. I was 9 or 10 years old. Initially, I was just a third wheel between my cousin and his bicycle. I wanted to ride a bike as well. So, I rented the smallest bicycle for a rupee an hour and learned it on my own in 3 days. After that, I had daily trysts with that rental bike.</p>

<p>With time, I got more confident. One day, I thought of riding my elder brother’s bike. The next day, I got up early morning and thought for a few minutes if I should wait for a few months, or should I go and take the risk? Why would it be risky? Because when on the bike, my toes would not reach the floor. I decided to take the risk.</p>

<p>I was maintaining the balance. As I could only pedal in half circles, momentum was slowly building up. Now of course something bad happened. I suddenly hit a speed breaker. And at that moment, I found myself sitting on the frame instead of the saddle, my toes barely touching the road. I knew if I hit the brakes now, I would fall and hurt myself. At some distance, I saw a bunch of hay on the footpath. I manoeuvred the bike to break my fall on that cushion. There were a few laughs from the bystanders. I also laughed with them.</p>

<p>That is how my childhood was. After 5th grade, my family moved to Delhi. That is where I spent all my teens and early twenties. People call that time the most impressionistic time. It was certainly true for me.</p>

<p>[I liked Biology, and because of that, I wanted to become a Doctor. My dad is a PhD in Stats. He and a few of his friends convinced me to take Computer Science (CS) elective in 11th grade. I started liking CS, but I was sad leaving biology behind.] – <em>dropped to reduce the number of words</em></p>

<p>My favourite subject, physics, led to me being obsessed with Einstein. I learned German to read some German texts authored by him. Unfortunately, reading was not easy even after learning spoken German. But, hey, I can at least say good day and introduce myself in German. :D</p>

<p>During those days, I developed my taste in instrumental music. Newfound interest and inspiration from Einstein made me curious about the violin. Impulsively, I bought the instrument and started learning it myself. Sadly, my impatience made me give it up after a few months. I still have that violin, and I gently weep whenever I see it.</p>

<p>[My love triangle with Physics and Computer Science made it fun when it was time for college. How do you decide between aspiring to work on nuclear physics or astrophysics and solving complex problems through a computer? Reality prevailed, and I opted for CS.] – <em>dropped to reduce the number of words</em></p>

<p>I started my Computer Science engineering in 2012. Data Science and Artificial Intelligence were booming. The idea that with simple maths applied to data, you can solve problems in any domain was just fascinating to me. In addition to Data Science, I also found a few CS subfields interesting. Ultimately, the question came to decide between software engineering and data science. Its interdisciplinary nature made me go for a career in Data Science.</p>

<p>Being a Data Scientist has given me exposure to many domains. In my 5+ years of work experience, I have applied my skills in the supply chain, fraud consulting, banking, and e-commerce industries. Deciding what to try next has not been easy. Staying at the current place has many perks, but going to the next one also has many exciting opportunities.</p>

<p>I have confronted this question of whether to stay or to go umpteen number of times. My observation is that I have rarely regretted my choice. And whenever it has not been favourable, I could always find my way out. But mostly, it has been diverting.</p>

<p>Just two days before the event, I decided to go on that Dandeli to Goa cycling trip, and it was the most exciting new year I have ever spent. I met a lot of strangers, many of whom have become my comrades now.</p>

<p><br /></p>

<h2 id="takeaways">Takeaways</h2>

<h3 id="time">Time</h3>

<p>It took me 6 minutes and 34 seconds to deliver my speech. That is 34 seconds over the allotted time. This time includes extra time because of the following elements:</p>

<ul>
  <li>Internet outage for a few seconds</li>
  <li>I fumbled a few times because I could not remember the words.</li>
  <li>Few long pauses because I was dehydrated</li>
  <li>A lot of filler words</li>
</ul>

<h3 id="setup">Setup</h3>

<p>Following is what I did to give the speech a feel of the stage performance:</p>

<ul>
  <li>I was standing while delivering the speech. It emulated the environment of speaking on a physical stage.</li>
  <li>I wore formals - a shirt and a pant. The audience only saw the shirt. It felt very unnatural delivering the speech in this attire.</li>
  <li>I had put the camera at eye level. If my camera had been lower, it would look like I am looking down upon the audience. They will also be looking up my nose. Having the camera at eye level makes the audience feel like I am looking at them and connecting with them.</li>
</ul>

<p>This setup corroborated the feedback that I was confident and comfortable with the virtual stage.</p>

<h3 id="posture">Posture</h3>

<p>I had a good stage presence because of my prep on the setup. I also came across as feeling natural and confident on the screen. Following are a few observations that I made about my posture:</p>

<ul>
  <li>My head had a tilt to the side during the first 2-3 minutes. Not sure why I was doing that. The takeaway for me was to practice keeping the posture straight above the shoulders.</li>
  <li>I was not comfortable delivering the speech while standing. I need to practice this further.</li>
  <li>Whenever I could not remember stuff, I looked at the ceiling, avoiding the screen. Of course, I need to remember my content better and practice it many times.</li>
</ul>

<h3 id="facial-expression">Facial Expression</h3>

<p>I was smiling throughout the speech. I feel that it was probably because the content was about me. Nevertheless, smiling gave my delivery a feeling of confidence and candidness. The takeaway is to have a personal element in future speeches so that my smile is natural.</p>

<p>Since the speech was about me, I did not feel a lack of confidence about the content. That probably contributed to how confident I was while speaking.</p>

<p>My facial expression changed at many places:</p>

<ul>
  <li>Weird appearances at places because I was thirsty and could not speak properly. Of course, hydration is a part of it. But I wonder if it could also be my nervousness. I will get to know more about this in future speeches.</li>
  <li>I was trying to gather saliva to make it easy to speak. I produced unnecessary sounds a few times.</li>
  <li>My lips were not level while speaking. It did not happen much, but it coincided with the head tilt. I will need to practice more.</li>
</ul>

<h3 id="pace">Pace</h3>

<p>The usual pace of speaking is 120 to 140 words per minute. Mine was within the range. I think I can increase it a bit, but that’s not a deal-breaker. There were a few areas of improvement I noticed:</p>

<ul>
  <li>I unknowingly paused at random places. Probably because of a lack of fluency in my thoughts.</li>
  <li>I messed up the pronunciations of a few words. Most were because I went through them quickly.
    <ul>
      <li><em>Suspicion</em> sounded like <em>suspension</em></li>
      <li><em>Technical</em> sounded like <em>pechnical</em> (there is no such word)</li>
      <li><em>Quandary</em> sounded like <em>quantry</em> (there is no such word)</li>
      <li><em>grew</em> sounded like <em>greeu</em></li>
      <li><em>trysts</em> sounded like <em>treests</em> (similar-sounding, but I should have emphasized the word more)</li>
      <li><em>impressionistic</em> - fumbled on this</li>
      <li><em>diverting</em> sounded like <em>divertee</em></li>
    </ul>
  </li>
</ul>

<h3 id="humour">Humour</h3>

<p>Most of the humorous parts got the response I had expected.</p>

<p>The delivery of the Tinder bit was proper. Most of the audience laughed at it.</p>

<p>The third wheel pun was subtle, but a few members got it. Although the delivery was correct, I messed up at the start.</p>

<p>The <em>gently weep at violin</em> bit made very few members smile.I think the delivery could have been better by my expressions or voice. I am not sure how to do it right now.</p>

<p>The part where I discussed the risk of riding my brother’s bike could also have a better delivery. Although, it may also not have been funny.</p>

<h3 id="filler-words">Filler Words</h3>

<p>I believed that I sparingly used filler words. I was mistaken. I used MANY filler words. Following are the stats:
<!--
|Filler|Count|
|---|---|
|Uh|12 times|
|like|6 times|
|yeah|4 times|
|Ah|3 times|
|so|3 times|
|um|3 times|
|and|2 times|
|but|1 times| --></p>

<div class="rendered_html">
<table>
<thead>
  <tr>
    <th>Uh</th>
    <th>Like</th>
    <th>Yeah</th>
    <th>Ah</th>
    <th>So</th>
    <th>Um</th>
    <th>And</th>
    <th>But</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>12 times</td>
    <td>6 times</td>
    <td>4 times</td>
    <td>3 times</td>
    <td>3 times</td>
    <td>3 times</td>
    <td>2 times</td>
    <td>1 times</td>
  </tr>
</tbody>
</table>
</div>

<p>There were many instances where words like <em>so</em>, <em>like</em>, <em>and</em>, <em>but</em> were not fillers. The above counts do not consider those usages. The first goal is to eliminate the use of filler <em>uh</em>.</p>

<h3 id="incomplete-sentences">Incomplete Sentences</h3>

<p>I did this 2-3 times. I started with a sentence that was not how I had written it. Midway, after remembering the exact words, I abruptly began with the original sentence.</p>

<p>It did not harm the message, but it hampered the fluency of the speech.</p>

<p>The solution is to either remember exact words from the speech or be good enough to continue with the paraphrased sentence. I am going to focus on the former for now.</p>

<h3 id="a-mixture-of-the-tenses">A Mixture of the Tenses</h3>

<p>Whenever I recount an episode from the past, I mix the present tense with the past. I miss this when writing blog posts as well. The lack of consistency has the potential to confuse the listeners. It also takes away the effect of your message. Few places where I noticed this happening was:</p>

<ul>
  <li><em>Which <strong>was</strong> my hometown</em> should have been <em>which <strong>is</strong> my hometown</em>.</li>
  <li><em>A speed breaker <strong>comes</strong></em> should have been <em>a speed breaker <strong>came</strong></em>.</li>
  <li><em>I <strong>am</strong> hanging on the frame</em> should have been <em>I <strong>was</strong> hanging on the frame</em>.</li>
  <li><em>I <strong>am raking</strong> my brain</em> → <em>I <strong>raked</strong> my brains</em>.</li>
</ul>

<h3 id="awkward-english-usage">Awkward English Usage</h3>

<p>There were many instances where I made sentences awkward while delivering. There were also grammar mistakes.</p>

<ul>
  <li><em>Utterly compelling <strong>things</strong></em> could have been <em>utterly compelling <strong>arguments</strong></em>.</li>
  <li><em>I was on saddle on that bike</em> could have been <em>I got on the saddle of that bike</em>, or <em>I got on the saddle</em>.</li>
  <li>I pronounced <em>there’s</em> like <em>theres</em> instead of <em>there is</em>. I feel the latter would have sounded better.</li>
  <li>Same with <em>I’ll</em>. Say, <em>I will</em>.</li>
  <li><em>I was taking <strong>slowly slowly</strong> moving my bike</em>. It could have been: <em>my bike was moving forward <strong>slowly</strong>.</em> Redundant words do not add any new information to the message.</li>
  <li><em>I slipped from my saddle my feet slipped from my pedal</em>. There are two corrections. It should have been: <em>pedal<strong>s</strong>.</em> And a better phrasing could have been as follows: <em>I slipped from the saddle. My feet also lost the grip off the pedals</em>.</li>
  <li><em>Data science and AI <strong>was</strong> booming</em> should have been: <em>Data Science and AI <strong>were</strong> booming</em>.</li>
  <li><em>CS fields, which again <strong>has</strong> a few good applications</em> → <em>CS fields, which again <strong>have</strong> a few good applications</em>.</li>
</ul>

<h3 id="content">Content</h3>

<p>The audience found the topic interesting. The chronology was simple to follow: childhood, teenage, and adulthood. Members also enjoyed the stories woven into the theme. In the anonymous feedback, someone gave me a rating of 3/5.</p>

<blockquote>
  <p>Engaging and interactive for the first attempt, could reduce the number of fillers in upcoming speeches.</p>
</blockquote>

<blockquote>
  <p>This was a great speech and got to know a lot about you and your interest.</p>
</blockquote>

<blockquote>
  <p>Absolutely loved your speech.. a wonderful story woven around an interesting theme that gave us a sneak peak into your journey so far. Looking forward to your future speeches. All the best!</p>
</blockquote>

<p>These comments told me that the content was suitable. An experienced member of the club gave me actionable feedback.</p>

<blockquote>
  <p>I could connect to your cycle incident and to the violin. Do you have more of those defining moments in your life? You should try and string them together. Rather than treat it as an interview where you have to state facts about you. People want to know you and not your resume. I guess the two stories stood out to me. And they way you kind of started recounting them.</p>

  <p>You could potentially work on thinking how you can make the audience not just hear it but experience it. That is, by transporting them to the incident. Create the environment by hinting the audience senses. Talk about what they would have seen, what they would have smelt, heard, felt. Then they are not just hearing it but being a part of the experience. Then they will know you a lot more. And it will be extremely powerful. I think you are very capable of doing this. Just try it out.</p>
</blockquote>

<p>Essentially, he meant that I should improve my storytelling skills and immerse the audience in the story. Make them experience it vicariously.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The first speech to be delivered in Toastmasters is an icebreaker speech. I had prepared the stage. My speech content was somewhat engaging. According to the feedback, the delivery felt natural and confident.</p>

<p>I found many positives and areas of improvement in my delivery. I have identified the following actionable points:</p>

<ol>
  <li>Remember the content well. That will eliminate many language issues.</li>
  <li>Make the stories more immersive for the audience.</li>
  <li>Practice delivering my speech by standing in front of a mirror or a camera.</li>
</ol>

<p>I have also identified the points which I will keep doing more of:</p>

<ol>
  <li>Keep smiling. A natural smile comes when the content is personal and relatable to the audience.</li>
  <li>Maintain the current pace of speaking.</li>
  <li>Have humour and deliver it well.</li>
</ol>

<p>My second speech should be a speech with a purpose. In my icebreaker speech, the theme loosely touched with the stories. So, I have to take parts of it and rewrite them to reflect a purpose.</p>]]></content><author><name>Shivam Rana</name></author><category term="communication" /><summary type="html"><![CDATA[Last year in November, I joined Toastmasters (TM) to build my communication and leadership skills. One part of the TM is the prepared speeches. You have to prepare and deliver a short speech in front of an audience. You also receive feedback from an evaluator.]]></summary></entry><entry><title type="html">Improving data science productivity with dslib</title><link href="https://trigonaminima.github.io/2022/02/dslib/" rel="alternate" type="text/html" title="Improving data science productivity with dslib" /><published>2022-02-09T00:00:00+00:00</published><updated>2022-02-09T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2022/02/dslib</id><content type="html" xml:base="https://trigonaminima.github.io/2022/02/dslib/"><![CDATA[]]></content><author><name>Shivam Rana</name></author><category term="work" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Git Basics</title><link href="https://trigonaminima.github.io/2021/10/git-training/" rel="alternate" type="text/html" title="Git Basics" /><published>2021-10-16T00:00:00+00:00</published><updated>2021-10-16T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2021/10/git-training</id><content type="html" xml:base="https://trigonaminima.github.io/2021/10/git-training/"><![CDATA[<p>I created this introduction to Git for the Data Science team of Swiggy. I believe, the better we know Git, the more efficient our engineering processes will be. Data Scientists come from a variety of educational backgrounds. Many of them will not know anything about git. Many others will just have a working knowledge; they will use git commands, but they won’t really know what really happens. Observing this gap in the team, I decided to work on bridging it.</p>

<p>This is the first in a series of two presentations. It was quite a hands-on session, where for each slide, I was showing the changes through terminal. I also used this amazing <a href="https://git-school.github.io/visualizing-git/#free-remote">Git tree visualization tool</a>.</p>

<object data="/assets/2021-10/Git Basics.pdf?#zoom=scale&amp;toolbar=0&amp;navpanes=0&amp;scrollbar=0&amp;view=,left" width="800" height="470" align="center" type="application/pdf"></object>

<hr />]]></content><author><name>Shivam Rana</name></author><summary type="html"><![CDATA[I created this introduction to Git for the Data Science team of Swiggy. I believe, the better we know Git, the more efficient our engineering processes will be. Data Scientists come from a variety of educational backgrounds. Many of them will not know anything about git. Many others will just have a working knowledge; they will use git commands, but they won’t really know what really happens. Observing this gap in the team, I decided to work on bridging it.]]></summary></entry><entry><title type="html">Minutes - Building the History Page</title><link href="https://trigonaminima.github.io/2021/08/flutter_app_3/" rel="alternate" type="text/html" title="Minutes - Building the History Page" /><published>2021-08-15T00:00:00+00:00</published><updated>2021-08-15T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2021/08/flutter_app_3</id><content type="html" xml:base="https://trigonaminima.github.io/2021/08/flutter_app_3/"><![CDATA[<p>In the last two posts [<a href="/2021/07/flutter_app_1/">1</a>, <a href="/2021/08/flutter_app_2/">2</a>], I had described the app I want for personal tracking and my progress on the app. I have to create five tabs: History, Dashboard, Tracker, People, and Settings. I have finished designing the Settings page. The Tracker page is in the making. Today, I am going to talk about the design of the History page.</p>

<h2 id="minutes---history-page">Minutes - History Page</h2>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-07/minutes-history.png" alt="" width="300" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Blank History screen</figcaption>
</figure>

<p>I had to build up the above blank screen to something similar to <a href="https://nomie.app/">Nomie’s</a> History page.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/nomie-history-tab.jpeg" alt="" width="300" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">A dated version of the Nomie History Page, copied from <a href="https://nomie.app/">Nomie website</a>.</figcaption>
</figure>

<p>Now, I have just started using Flutter and Dart for Minutes. Naturally, I did not know how to build the History page. Since I was not as lucky as I was during the Settings screen creation, I had to work with the online tutorials and build it from scratch. The History page is a list of logs in reverse chronological order. I call each of the log a <strong>Record Card</strong>.</p>

<h2 id="record-card-particulars">Record Card Particulars</h2>

<p>It is good to list down the broad data points that each Record card should show:</p>

<ol>
  <li>The amount of time that has passed from now</li>
  <li>Creation timestamp</li>
  <li>Settings for tracker record data - edit, copy, delete, share.</li>
  <li>Details of the Tracker - emoji, name, value.</li>
  <li>Note added to the log.</li>
  <li>People mentioned in the log.</li>
  <li>Context added to the log note. Context is something that adds metadata labels to the records. Think of it like hashtags.</li>
</ol>

<p>Nomie’s record card inspired all of these points. I’ll discuss them in detail in the Anatomy section.</p>

<p>The following section talks about layouts in Flutter. Feel free to skip to the <a href="#anatomy-of-the-record-card">anatomy section</a>.</p>

<h2 id="flutter-layout-concepts">Flutter Layout Concepts</h2>

<p>I had to understand all the below topics to build my Record card.</p>

<h3 id="listtile-vs-card">ListTile vs Card</h3>

<p>I found many tutorials that used <code class="language-plaintext highlighter-rouge">ListTile</code> instead of a <code class="language-plaintext highlighter-rouge">Card</code> to build a card type widget. The <a href="https://api.flutter.dev/flutter/material/ListTile-class.html"><code class="language-plaintext highlighter-rouge">ListTile</code> class</a>might have complicated the card design, so I selected the <a href="https://api.flutter.dev/flutter/material/Card-class.html"><code class="language-plaintext highlighter-rouge">Card</code> class</a>. Besides, the <code class="language-plaintext highlighter-rouge">Card</code> class is specifically there to build a card. I also found a few good articles [<a href="https://blog.logrocket.com/building-a-card-widget-in-flutter/">1</a>, <a href="https://material.io/components/cards/flutter#card">2</a>] to help me create the card.</p>

<h3 id="column--row-combo-is-all-you-need">Column + Row Combo is all you Need</h3>

<p>A Card is a rectangular box having a child widget with the complete card layout. And the card layout is built by <a href="https://api.flutter.dev/flutter/widgets/Column-class.html">Column</a> and <a href="https://api.flutter.dev/flutter/widgets/Row-class.html">Row</a> classes. All the child widgets in a <code class="language-plaintext highlighter-rouge">Column</code> are displayed vertically, top to bottom. This order reverses in a <code class="language-plaintext highlighter-rouge">Row</code> from left to right. With a combination of these two classes, you can build any complex layout. Here is such an example from <a href="https://flutter.dev/docs/development/ui/layout">Flutter’s documentation on Layouts</a>.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/card-layout1.png" alt="" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Source: <a href="https://flutter.dev/docs/development/ui/layout#lay-out-multiple-widgets-vertically-and-horizontally">Layouts in Flutter</a></figcaption>
</figure>

<p>The outer red box is a <code class="language-plaintext highlighter-rouge">Row</code> widget that contains a <code class="language-plaintext highlighter-rouge">Column</code> (green rectangle) and an image widget. In the image below, this green <code class="language-plaintext highlighter-rouge">Column</code> widget becomes a vertical array of two <code class="language-plaintext highlighter-rouge">Text</code> and two <code class="language-plaintext highlighter-rouge">Row</code> widgets. The two inner <code class="language-plaintext highlighter-rouge">Row</code> widgets break into more <code class="language-plaintext highlighter-rouge">Row</code> and <code class="language-plaintext highlighter-rouge">Column</code> widgets.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/card-layout2.png" alt="" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Source: <a href="https://flutter.dev/docs/development/ui/layout#lay-out-multiple-widgets-vertically-and-horizontally">Layouts in Flutter</a></figcaption>
</figure>

<h3 id="main--cross-axes">Main &amp; Cross Axes</h3>

<p>The <code class="language-plaintext highlighter-rouge">Row</code> and <code class="language-plaintext highlighter-rouge">Column</code> widgets also bring the <code class="language-plaintext highlighter-rouge">Main</code> and <code class="language-plaintext highlighter-rouge">Cross</code> axes. These axes are convenient ways of aligning child widgets. Even after reading their description from the documentation, I couldn’t figure out their usage. The following figure from <a href="https://flutter.dev/docs/development/ui/layout">Flutter’s documentation on Layouts</a> explains how they align depending on the widget.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/column-row-axes.jpg" alt="" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Source: <a href="https://flutter.dev/docs/development/ui/layout#aligning-widgets">Layouts in Flutter</a></figcaption>
</figure>

<h2 id="anatomy-of-the-record-card">Anatomy of the Record Card</h2>

<p>I will mark the Record card for every <code class="language-plaintext highlighter-rouge">Row</code> and <code class="language-plaintext highlighter-rouge">Column</code> block used. I will also go into detail about the displayed information and the code.</p>

<h3 id="record-card-layout">Record Card Layout</h3>

<p>The Record card has multiple folds of <code class="language-plaintext highlighter-rouge">Row</code> and <code class="language-plaintext highlighter-rouge">Column</code> widgets. The card itself is a <code class="language-plaintext highlighter-rouge">Column</code> widget that contains more blocks.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-record-column-fold1.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Each Record card is a single Column.</figcaption>
</figure>

<p>This <code class="language-plaintext highlighter-rouge">Column</code> widget contains five individual <code class="language-plaintext highlighter-rouge">Row</code> widgets stacked vertically as below.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-record-rows-fold2.png" alt="" width="450" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">The single Column is made up of 5 Row objects</figcaption>
</figure>

<p>The <code class="language-plaintext highlighter-rouge">Row</code> widgets from 3 to 5 are simple, non-divisible widgets. The following figure shows the breakage of <code class="language-plaintext highlighter-rouge">Row 1</code> into a <code class="language-plaintext highlighter-rouge">Column</code> that further contains two <code class="language-plaintext highlighter-rouge">Row</code> widgets.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/row1-details.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Row 1 further broken into Column and Row.</figcaption>
</figure>

<p>And <code class="language-plaintext highlighter-rouge">Row 2</code> contains an inner <code class="language-plaintext highlighter-rouge">Row</code> that has a single <code class="language-plaintext highlighter-rouge">Column</code>.</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/row2-details.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Row 2 further broken into Row and Column.</figcaption>
</figure>

<p>The blank space after the first <code class="language-plaintext highlighter-rouge">Row</code> is for other elements that might come later. I did not want to centre or stretch the card details over the whole <code class="language-plaintext highlighter-rouge">Row</code>.</p>

<h3 id="record-card-particulars---detailed">Record Card Particulars - Detailed</h3>

<h4 id="timestamp-and-settings-popup">Timestamp and Settings popup</h4>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-record-card-row-1.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto; border: 1px solid #000;" />
    <!-- <figcaption style="text-align: center">Row 2 further broken into Row and Column.</figcaption> -->
</figure>

<p>The first goal was to have the time passed string like - now, 3 hours ago, 1 day ago, 2 months ago, etc. - from the log creation time. Instead of writing multiple conditional statements, I used the Flutter package called <a href="https://pub.dev/packages/timeago">timeago</a>. It gave me precisely the functionality I wanted. Just put in the DateTime object, and it will return the formatted string.</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">timeago</span><span class="o">.</span><span class="na">format</span><span class="p">(</span><span class="n">endTime</span><span class="p">);</span>
</code></pre></div></div>

<p>The second goal was to display the timestamp. I had to account for two types of Trackers: Counter and Timer. The Counter will have a creation time, and the Timer will have both the start and the end times.</p>

<p>The third objective was to have the three horizontal dots settings popup for each card. The popup menu was easy to create using the <a href="https://api.flutter.dev/flutter/material/PopupMenuButton-class.html"><code class="language-plaintext highlighter-rouge">PopupMenuButton</code> class</a>. I read various articles [<a href="https://flutteropen.gitbook.io/flutter-widgets/flutter-widgets-14-flutter-popup-menu-button">1</a>, <a href="https://codesinsider.com/flutter-popup-menu-button/">2</a>, <a href="https://medium.com/flutter-community/a-better-flutter-menu-b1472d24a">3</a>, <a href="https://stackoverflow.com/q/58144948/2650427">4</a>] to design and position the popup menu on clicking. By default, the <code class="language-plaintext highlighter-rouge">PopupMenuButton</code> gives you three vertical dots. Rotating the dots by 90 degrees did not work well. Then I found the <code class="language-plaintext highlighter-rouge">icon</code> parameter in the <code class="language-plaintext highlighter-rouge">PopupMenuButton</code> class and used the <a href="https://fonts.google.com/icons?selected=Material%20Icons%20Outlined%3Amore_horiz%3A">More Horiz</a> icon. Following is the corresponding code:</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PopupMenuButton</span><span class="p">(</span>
    <span class="nl">icon:</span> <span class="n">Icon</span><span class="p">(</span><span class="n">Icons</span><span class="o">.</span><span class="na">more_horiz</span><span class="p">),</span>
    <span class="nl">iconSize:</span> <span class="mi">20</span><span class="p">,</span>
    <span class="nl">itemBuilder:</span> <span class="n">PopupMenu</span><span class="p">,</span>
<span class="p">);</span>
</code></pre></div></div>

<h4 id="mini-tracker-widget">Mini Tracker Widget</h4>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-record-card-row-2.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto; border: 1px solid #000;" />
    <!-- <figcaption style="text-align: center">Row 2 further broken into Row and Column.</figcaption> -->
</figure>

<p>The mini Tracker widget shows the information about the Tracker used to create the tracking log. Each tracker has an emoji and a name. And the log also saves the value of the tracker. For example, in the Record shown above, the mini Tracker’s name is Mood tracker has a Rainbow emoji. And its value in this particular entry was 10 (on a 10-point scale).</p>

<p>The row looks empty because I will have plans to add more information to it once the Minutes’ Tracker page is complete.</p>

<h4 id="record-note">Record Note</h4>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-record-card-row-3.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto; border: 1px solid #000;" />
    <!-- <figcaption style="text-align: center">Row 2 further broken into Row and Column.</figcaption> -->
</figure>

<p>It is a simple Text widget created to display the note added to the record entry. The names when mentioned using @ should hyperlink to that person’s details. I will add this feature later.</p>

<h4 id="people-mentioned-in-the-record">People mentioned in the Record</h4>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-record-card-row-4.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto; border: 1px solid #000;" />
    <!-- <figcaption style="text-align: center">Row 2 further broken into Row and Column.</figcaption> -->
</figure>

<p>In the record note, you can add people using the @ prefix. These people will be extracted and saved for later stats. I wanted to highlight the people mentioned in the note.</p>

<p>The Flutter term for the widget displaying each person in the above screenshot is a <a href="https://material.io/components/chips">chip</a>. There are multiple kinds of chips. For the <a href="https://material.io/components/chips#action-chips">clickable functionality</a>, I used the <a href="https://api.flutter.dev/flutter/material/ActionChip-class.html"><code class="language-plaintext highlighter-rouge">ActionChip</code> class</a>. This <a href="https://medium.com/aubergine-solutions/flutter-widget-in-focus-chip-know-it-all-1c46217dca9b">medium article</a> has a good explanation of the chips.</p>

<p>I faced some hiccups in spacing, padding and wrapping of the chip widgets. The linked <a href="https://stackoverflow.com/q/57862775/2650427">SO answer</a> showed me how to use <code class="language-plaintext highlighter-rouge">padding</code> and <code class="language-plaintext highlighter-rouge">labelPadding</code> parameters. I also used the <code class="language-plaintext highlighter-rouge">visualDensity</code> parameter. In hindsight, I should have read the documentation first.</p>

<p>I also had to handle the wrapping of the chips to the new line. The <a href="https://api.flutter.dev/flutter/widgets/Wrap-class.html"><code class="language-plaintext highlighter-rouge">Wrap</code> class</a> class exactly does that. The <code class="language-plaintext highlighter-rouge">Wrap</code> was not playing well with the <code class="language-plaintext highlighter-rouge">Row</code> class. So this <a href="https://stackoverflow.com/q/55851918/2650427">answer</a> suggested eliminating the <code class="language-plaintext highlighter-rouge">Row</code> class. And the padding between multiline chips was handled using the <code class="language-plaintext highlighter-rouge">runSpacing</code> and <code class="language-plaintext highlighter-rouge">spacing</code> parameters of the <code class="language-plaintext highlighter-rouge">Wrap</code> class. Here is how my code looks:</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Wrap</span><span class="p">(</span>
    <span class="nl">runSpacing:</span> <span class="o">-</span><span class="mi">8</span><span class="p">,</span>
    <span class="nl">spacing:</span> <span class="mf">2.0</span><span class="p">,</span>
    <span class="nl">children:</span> <span class="kt">List</span><span class="o">.</span><span class="na">generate</span><span class="p">(</span><span class="n">people</span><span class="o">.</span><span class="na">length</span><span class="p">,</span> <span class="p">(</span><span class="n">index</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">PersonChip</span><span class="p">(</span><span class="nl">person:</span> <span class="n">people</span><span class="p">[</span><span class="n">index</span><span class="p">]);</span>
    <span class="p">}),</span>
<span class="p">);</span>
</code></pre></div></div>

<h4 id="contexts-mentioned-in-the-record">Contexts mentioned in the Record</h4>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-record-card-row-5.png" alt="" width="400" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto; border: 1px solid #000;" />
    <!-- <figcaption style="text-align: center">Row 2 further broken into Row and Column.</figcaption> -->
</figure>

<p>Identical to the last section, I used <code class="language-plaintext highlighter-rouge">Wrap</code> to show multiline context chips using all the required padding and spacing parameters.</p>

<h4 id="other-elements">Other Elements</h4>

<p>I used <a href="https://api.flutter.dev/flutter/widgets/SizedBox-class.html"><code class="language-plaintext highlighter-rouge">SizedBox</code> class</a> to add blank space between different layout elements. It can add space in both vertical, as well as, horizontal directions.</p>

<p>The <a href="https://api.flutter.dev/flutter/material/Divider-class.html"><code class="language-plaintext highlighter-rouge">Divider</code> class</a> came in handy to add the horizontal line between the last two sections. There are, of course, a host of <a href="https://material.io/components/dividers">options</a> available for dividers.</p>

<p>The last thing was to make my card tappable. I wanted to make each record card interactive. Currently, it is just the ripple effect on each tap, but I have thought of opening the record log editing window. I learned that I had to wrap my card widget inside a clickable widget. I had a <a href="https://stackoverflow.com/q/49959617/2650427">few options</a> available, out of which I chose the <a href="https://api.flutter.dev/flutter/material/InkWell-class.html"><code class="language-plaintext highlighter-rouge">InkWell</code> class</a>.</p>

<h3 id="history-tab">History Tab</h3>

<p>I used the <a href="https://api.flutter.dev/flutter/widgets/ListView-class.html"><code class="language-plaintext highlighter-rouge">ListView</code> class</a> to display all the Records cards vertically in the History tab.</p>

<h2 id="lessons-learned">Lessons Learned</h2>

<p>While coding and debugging the errors, I learned many new things.</p>

<h3 id="flutterdart-syntax">Flutter/Dart Syntax</h3>

<p>Many were syntax related. For example, I need to use <code class="language-plaintext highlighter-rouge">x.runtimeType</code> to print the data type of a variable. I also observed the similarity between the if-else conditional statements in Dart and C programming language. The <a href="https://en.wikipedia.org/wiki/Dart_(programming_language)">wiki page</a> says that Dart has <em>C-style</em> syntax.</p>

<h3 id="variable-declaration">Variable Declaration</h3>

<p>Another learning was the difference between the keywords like <code class="language-plaintext highlighter-rouge">var</code>, <code class="language-plaintext highlighter-rouge">final</code>, and <code class="language-plaintext highlighter-rouge">const</code> for declaring a variable. If you explicitly assign a datatype (<code class="language-plaintext highlighter-rouge">String</code>, <code class="language-plaintext highlighter-rouge">int</code>, <code class="language-plaintext highlighter-rouge">List</code>, etc.) to a variable, it becomes static. That means that you can not update its type throughout the runtime. If you do not know the data type beforehand, then you use the <code class="language-plaintext highlighter-rouge">var</code> keyword. This way you can assign any data type to the variable throughout the code runtime, just like you can in Python. This variable will be called <em>dynamic</em>. The <code class="language-plaintext highlighter-rouge">final</code> keyword indicates that the data type of the variable will not change now. Finally, the <code class="language-plaintext highlighter-rouge">const</code> keyword makes the variable immutable. [ref: <a href="https://stackoverflow.com/q/12416507/2650427">SO</a>]</p>

<h3 id="datetime-in-flutter">DateTime in Flutter</h3>

<p>Learning how to manipulate the <a href="https://api.dart.dev/stable/2.13.4/dart-core/DateTime-class.html"><code class="language-plaintext highlighter-rouge">DateTime</code> class</a> objects in Flutter was also fun. There was no in-built method to convert DateTime variables into formatted strings. I had to install and import the <a href="https://pub.dev/packages/intl">intl package</a>. This package gave me the <a href="https://pub.dev/documentation/intl/latest/intl/DateFormat-class.html"><code class="language-plaintext highlighter-rouge">DateFormat</code> class</a> that had a lot of formatting options available. Python has this whole formatting thing included in the core datetime class.</p>

<p>There was a moment while creating the card where I was puzzled by what was happening. I wanted to display the formatted timestamp on the card, but the timezone was getting mixed up. At some places it was IST, and at the other, it was UTC. To have the same timezone, I went down the timezone rabbit hole. There is no direct support of timezones in Flutter/Dart. But there is a timezones package by Google, but its usage included adding the timezone database as an asset in your app. I wanted to avoid that. I realized that the IST dates already have the timezone info. That means that I have missed giving that information to the remaining dates. Long debugging story short, I had to tell all the <code class="language-plaintext highlighter-rouge">DateTime</code> objects to use <code class="language-plaintext highlighter-rouge">.toLocal()</code> to pick the timezone information from my device.</p>

<h3 id="hot-reloading">Hot Reloading</h3>

<p>While building the app, I use hot reloading to see the effects of my changes. There were two issues I have started facing regarding hot reloading. Intermittently, it stops refreshing. First, I try restarting the debug session. If that does not solve the problem, I wipe clean my Android Emulator data and Cold Reboot it.</p>

<p>There was a moment when hot reloading took a lot of time to refresh. The usual time of under 1 second increased to 7 seconds. Running <code class="language-plaintext highlighter-rouge">flutter clean</code> solved the problem then, but I am not sure if it is the solution.</p>

<h3 id="code-organization">Code Organization</h3>

<p>The History tab has lead to the creation of a lot of files in my directory. I could not decide between the two possibilities of the code arrangement. There were a lot of articles on this topic. The <a href="https://medium.com/flutter-community/flutter-code-organization-de3a4c219149">one</a> blog post I read gave me some clarity, but I will need to read more about it.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The History tab and the Record card for this tab was a decently large task to keep me busy during the weekend. It taught me a lot, and I am satisfied with the final results.</p>

<figure class="image">
    <!-- <img src="https://trigonaminima.github.io/assets/2021-07/minutes-homepage1.png" alt=""> -->
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-history.gif" alt="" width="300" style="margin: auto;" />
    <figcaption style="text-align: center">How Minutes' History screen looks now.</figcaption>
</figure>

<p>To make the cards look better, I will need to look into <a href="https://material.io/components/cards/flutter#theming-a-card">card themes</a> once the Minutes v1 is in place. The next post is going to be about either the Tracker or the People screen.</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-Self" /><summary type="html"><![CDATA[In the last two posts [1, 2], I had described the app I want for personal tracking and my progress on the app. I have to create five tabs: History, Dashboard, Tracker, People, and Settings. I have finished designing the Settings page. The Tracker page is in the making. Today, I am going to talk about the design of the History page.]]></summary></entry><entry><title type="html">Minutes - Building the Settings Page</title><link href="https://trigonaminima.github.io/2021/08/flutter_app_2/" rel="alternate" type="text/html" title="Minutes - Building the Settings Page" /><published>2021-08-08T00:00:00+00:00</published><updated>2021-08-08T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2021/08/flutter_app_2</id><content type="html" xml:base="https://trigonaminima.github.io/2021/08/flutter_app_2/"><![CDATA[<p>In the <a href="/2021/07/flutter_app_1/">previous post</a>, I had talked about the Quantified Self and how I started working on an app for that. I had shown the bottom navigation with the following five screens on the homepage:</p>

<ol>
  <li>History</li>
  <li>Dash</li>
  <li>Track</li>
  <li>People</li>
  <li>Settings</li>
</ol>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-screens.png" alt="" width="300" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">5 basic screens in Minutes</figcaption>
</figure>

<p>The screens did not contain any content. Tapping on each icon just switched to that screen (as the gif at the end of the <a href="/2021/07/flutter_app_1/">previous post</a> showed).</p>

<p>I had decided to work on the Tracker page as it is the primary function of the app. After reading my previous post, <a href="https://github.com/nickedes">Nikhil</a> wanted to help me out with the app. So we started working on the tracker screen together. He created a simple tracker card. I added on top of it. But soon, we were getting a lot of merge conflicts while rebasing our git branches. So I decided to work on the Settings screen while Nikhil finishes his part on the tracker screen. I will talk about the trackers when we have completed the first draft of it.</p>

<p>In this post, I will discuss the Settings screen.</p>

<h2 id="minutes---settings">Minutes - Settings</h2>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-07/minutes-settings.png" alt="" width="300" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">Blank Settings screen</figcaption>
</figure>

<p>I had a completely blank Settings screen. Since I was replicating the Settings page of Nomie, I knew various toggles and setting options I wanted to add to Minutes. I googled for a tutorial to get started and instead found <a href="https://pub.dev/packages/settings_ui">Settings UI for Flutter</a> package. This package has abstracted out everything and provided me with <code class="language-plaintext highlighter-rouge">SettingsList</code>, <code class="language-plaintext highlighter-rouge">SettingsSection</code>, <code class="language-plaintext highlighter-rouge">SettingsTile</code>, and <code class="language-plaintext highlighter-rouge">CustomSection</code> classes to build the page.</p>

<p>The <code class="language-plaintext highlighter-rouge">SettingsList</code> combines various <code class="language-plaintext highlighter-rouge">SettingsSection</code> widgets, which further contains many <code class="language-plaintext highlighter-rouge">SettingsTile</code> widgets. If there is something you can not do using <code class="language-plaintext highlighter-rouge">SettingsSection</code>, you also have <code class="language-plaintext highlighter-rouge">CustomSection</code> using which you can build anything your creative mind comes up with.</p>

<p>The documentation for the package is non-existent. There is only the <a href="https://github.com/yako-dev/flutter-settings-ui/tree/master/example">sample flutter app</a> in its Github repo which gets you started. I still had to get into the package code in search of a few options for <code class="language-plaintext highlighter-rouge">SettingsTile</code>.</p>

<h3 id="using-the-package">Using the Package</h3>

<p>It needs to be added to the <code class="language-plaintext highlighter-rouge">pubspec.yaml</code> under the <code class="language-plaintext highlighter-rouge">dependencies</code> as follows:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">dependencies</span><span class="pi">:</span>
  <span class="na">settings_ui</span><span class="pi">:</span> <span class="s">^1.0.0</span>
</code></pre></div></div>

<p>When you save this file, the Dart vscode extension installs it and adds it to the <code class="language-plaintext highlighter-rouge">pubspec.lock</code>. After installing the package, I imported it like any other library.</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="s">'package:settings_ui/settings_ui.dart'</span><span class="o">;</span>
</code></pre></div></div>

<h3 id="settings-sections">Settings Sections</h3>

<p>I followed the sample <a href="https://github.com/yako-dev/flutter-settings-ui/tree/master/example">example</a> to build the working settings page for me and then added all the settings related to Minutes. These settings belong to the following sections:</p>

<ol>
  <li>Tracking: Settings related to trackers reside here.</li>
  <li>Locale: Settings like 24 hr clock or week start day.</li>
  <li>Security: Enabling fingerprint unlocking or locking app in the background</li>
  <li>Notifications: Various settings related to notifications</li>
  <li>Storage: locations where you can store the tracking data - local, Dropbox, Google Drive, Remove server, etc.</li>
  <li>Import Data: Things related to importing data in Minutes</li>
  <li>Export Data: How to export data from Minutes</li>
  <li>Usage Stats: App usage stats. This section might move to the Dash screen later.</li>
  <li>Danger Zone: Deleting all the data</li>
</ol>

<p>These nine sections will likely evolve into something else as we will develop the app further.</p>

<p>Currently, I have defined all the sections in a single dart file. I wanted to put each one in a separate file and then use it directly in the main file, but I couldn’t make it work. So that is a to-do for me as a part of code refactoring in future.</p>

<h3 id="adding-image-assets-to-the-app">Adding Image Assets to the App</h3>

<p>In the sample example app from the <code class="language-plaintext highlighter-rouge">settings-ui</code> package, there is an image displayed at the footer. I wanted to add it to my Settings page as well. No matter what I did, it was not working. Code always threw an exception as it was not able to find the image. After spending 5 minutes on it, I searched for including image assets in a Flutter app. It turns out I needed to add the image assets to the <code class="language-plaintext highlighter-rouge">pubspec.yaml</code> under the <code class="language-plaintext highlighter-rouge">assets</code>. Here is how to do it:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="na">assets</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">assets/settings.png</span>
</code></pre></div></div>

<h2 id="darklight-theme">Dark/Light Theme</h2>

<p>While searching for refactoring the sections code, I found a snippet where the author added the Dark/Light theme support to their app. I hadn’t even looked for it before because I thought it would take a lot of time to add that support. Surprisingly, it was just a short piece of code change that did the whole thing. In the <code class="language-plaintext highlighter-rouge">main.dart</code> file, within the <code class="language-plaintext highlighter-rouge">MaterialApp</code>, I had to add the following part:</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="nl">theme:</span> <span class="n">ThemeData</span><span class="p">(</span>
        <span class="nl">brightness:</span> <span class="n">Brightness</span><span class="o">.</span><span class="na">light</span><span class="p">,</span>
      <span class="p">),</span>
      <span class="nl">darkTheme:</span> <span class="n">ThemeData</span><span class="p">(</span>
        <span class="nl">brightness:</span> <span class="n">Brightness</span><span class="o">.</span><span class="na">dark</span><span class="p">,</span>
      <span class="p">),</span>
</code></pre></div></div>

<p>The complete Minutes main app widget looks like this:</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">Minutes</span> <span class="kd">extends</span> <span class="n">StatelessWidget</span> <span class="p">{</span>
  <span class="kd">static</span> <span class="kd">const</span> <span class="kt">String</span> <span class="n">_title</span> <span class="o">=</span> <span class="s">'Minutes'</span><span class="p">;</span>

  <span class="nd">@override</span>
  <span class="n">Widget</span> <span class="n">build</span><span class="p">(</span><span class="n">BuildContext</span> <span class="n">context</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">MaterialApp</span><span class="p">(</span>
      <span class="nl">title:</span> <span class="n">_title</span><span class="p">,</span>
      <span class="nl">theme:</span> <span class="n">ThemeData</span><span class="p">(</span>
        <span class="nl">brightness:</span> <span class="n">Brightness</span><span class="o">.</span><span class="na">light</span><span class="p">,</span>
      <span class="p">),</span>
      <span class="nl">darkTheme:</span> <span class="n">ThemeData</span><span class="p">(</span>
        <span class="nl">brightness:</span> <span class="n">Brightness</span><span class="o">.</span><span class="na">dark</span><span class="p">,</span>
      <span class="p">),</span>
      <span class="nl">home:</span> <span class="n">Home</span><span class="p">(),</span>
    <span class="p">);</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>In quite a short amount of time, the Settings page went from looking all yellow to what you see in the below gif.</p>

<figure class="image">
    <!-- <img src="https://trigonaminima.github.io/assets/2021-07/minutes-homepage1.png" alt=""> -->
    <img src="https://trigonaminima.github.io/assets/2021-08/minutes-settings.gif" alt="" width="300" style="margin: auto;" />
    <figcaption style="text-align: center">How Minutes Settings screen looks now</figcaption>
</figure>

<p>Nikhil should be close to completing his part on the tracker page. So next time we should hopefully look at the tracker page.</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-Self" /><summary type="html"><![CDATA[In the previous post, I had talked about the Quantified Self and how I started working on an app for that. I had shown the bottom navigation with the following five screens on the homepage:]]></summary></entry><entry><title type="html">Minutes - A Quantified Self App for Myself</title><link href="https://trigonaminima.github.io/2021/07/flutter_app_1/" rel="alternate" type="text/html" title="Minutes - A Quantified Self App for Myself" /><published>2021-07-25T00:00:00+00:00</published><updated>2021-07-25T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2021/07/flutter_app_1</id><content type="html" xml:base="https://trigonaminima.github.io/2021/07/flutter_app_1/"><![CDATA[<p>I’ve long since <a href="/2014/11/gamification-of-life/">2014</a> engaged in some form of <a href="https://en.wikipedia.org/wiki/Quantified_self">Quantified Self</a>. It started with a simple Libre Office worksheet. I tracked my finances, sleep, time spent on entertainment, walking steps, and many other things. Maintaining a worksheet had a lot of flaws. I didn’t want to put my data in a Google worksheet, so it always remained on my laptop. Sometimes, I forgot to enter the data, and other times, I couldn’t access the device. There was also a limit on what I could track through it. I wanted to know where my time was going within a day. To manually enter everything takes a lot of time. I wanted something automated. I found many useful tools while trying to automate these things.</p>

<p><a href="https://activitywatch.net/">ActivityWatch</a> is used to track everything you do on your laptop. It has quite an extensive set of features, but it was buggy. So I used it on and off.</p>

<p><a href="https://sleep.urbandroid.org/">Sleep as Android</a> [<a href="https://play.google.com/store/apps/details?id=com.urbandroid.sleep&amp;hl=en_IN&amp;gl=US">Play Store</a>] to track my sleep. With the help of this app, I found out how many hours of sleep I require to feel fresh. I still use this app.</p>

<p>I use the <a href="https://play.google.com/store/apps/details?id=com.onetwoapps.mh">My Budget Book</a>] app to track my expenses. It’s only available for the smartphone, so all the tracking happens within the phone. I manually enter all my expenses in this app right when I’ve spent the money. It works for all things except investments. I didn’t find a suitable solution for this. <a href="https://www.gnucash.org/">GnuCash</a> seems close, but the account setup and tracking are a bit involved.</p>

<p><a href="https://nomie.app/">Nomie</a> is a simple and quite helpful app. You create different “trackers” for things you want to track and just get started with tracking. Trackers can be count-based, time-based, or text-based. I’ve been using it for almost a year now. It gives a decent view of where I spent my day. It also has a rudimentary dashboard where you can add different graphs based on the trackers you’ve created. Notwithstanding the features, it’s slow, and I couldn’t agree with some of the design decisions of the app.</p>

<p><a href="https://ifttt.com/">IFTTT</a> is a tool to collect data from a phone and push it somewhere. I used it to track the time spent on calls and upload the records to Nomie (yes, Nomie also has an API). It used to be free before, but now it’s subscription-based, and their subscription model is costly. It also sometimes refused to work on my device. So I had to give up on it.</p>

<p>I didn’t stick with a lot of other apps because of a multitude of reasons. Some didn’t have data export functionality, meaning my data will always remain locked with them. Others didn’t let me use the app without an account. I am always sceptical about their security and data breaches, so I skipped those. Many others didn’t have cross-platform support. I either have to be stuck on my smartphone or my laptop.</p>

<p>After trying out so many tools, I know what exactly I want to track and how I want the app to look and function. So now, I’ve started working on an app for personal usage. The starting point for it is going to be Nomie. I’ll skip all the things I don’t want. I am going to call this app <strong>Minutes</strong>.</p>

<p>I am going to use <a href="https://flutter.dev/">Flutter</a> for this project. Flutter will give me native Android and web apps. That handles the cross-platform support I need for the app. I am going to document my process of building this app. I’ll discuss all the design decisions along with the final output. I’ve never created a smartphone app before. So this blog series is also going to be a documentation of my learnings as well. Flutter uses <a href="https://en.wikipedia.org/wiki/Dart_(programming_language)">Dart programming language</a>, so this is also an opportunity to learn Dart. Yay.</p>

<p>Let’s get started with the first part.</p>

<h2 id="flutter-documentation">Flutter Documentation</h2>

<p>The <a href="https://flutter.dev/docs/get-started">getting started</a> page <em>really</em> gets you started with everything required. It takes time to set up everything as it downloads many things. I did the setup in Linux and macOS. Linux took some figuring out a few path issues, but macOS was straightforward. It helps you set up your IDE as well. There are three available IDEs - Android Studio and IntelliJ, VS Code, and Emacs. I am sure there’d be support for other editors as well. I use VS Code, so everything was smooth for me.</p>

<p>Building the first app was also very effortless because of this well-documented <a href="https://flutter.dev/docs/get-started/codelab">tutorial</a>. The dopamine hit of installing Flutter and creating a hello world app in a single sitting was satisfying.</p>

<p>I understood the basics of <code class="language-plaintext highlighter-rouge">Widget</code>, <code class="language-plaintext highlighter-rouge">StatelessWidget</code>, and <code class="language-plaintext highlighter-rouge">StatefulWidget</code>. While creating the <code class="language-plaintext highlighter-rouge">State</code> class corresponding to the <code class="language-plaintext highlighter-rouge">StatefulWidget</code>, the class name is prefixed, by default, with an underscore (<code class="language-plaintext highlighter-rouge">_</code>). That ensures privacy in Dart. It’s a bit similar to Python, where class methods prefixed with a single underscore are informally considered non-public methods. There’s also a good collection of <a href="https://fonts.google.com/icons">icons</a> available in Google’s Material design features, which you can directly access using <code class="language-plaintext highlighter-rouge">Icons.&lt;icon_id&gt;</code>. There were other things that I didn’t understand and just copied from the tutorial.</p>

<p>Hot reloading is also a cool feature. It hugely reduces the development time.</p>

<h2 id="minutes-homepage">Minutes Homepage</h2>

<p>Let’s discuss the first looks of Minutes. After working through the basics, I started cloning Nomie. I first want to replicate the tracker tab. That is how it looks in Nomie:</p>

<figure class="image">
    <img src="https://trigonaminima.github.io/assets/2021-07/nomie-tracker-tab.jpeg" alt="" width="300" style="text-align: center; margin: auto" />
    <figcaption style="text-align: center">Nomie Tracker Page, copied from <a href="https://nomie.app/">Nomie website</a>.</figcaption>
</figure>

<p>The first thing from this page I am going to copy is the bottom navigation. I followed this 2018 <a href="https://willowtreeapps.com/ideas/how-to-use-flutter-to-build-an-app-with-bottom-navigation">article</a>. It almost got me to the final thing. There were a few hiccups.</p>

<p>First, I had to remove the deprecated parts of the code while building each tab in the bottom navigation bar. The original snippet was this:</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">BottomNavigationBarItem</span><span class="p">(</span>
    <span class="nl">icon:</span> <span class="n">Icon</span><span class="p">(</span><span class="n">Icons</span><span class="o">.</span><span class="na">home</span><span class="p">),</span>
    <span class="nl">title:</span> <span class="n">Text</span><span class="p">(</span><span class="s">'Home'</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>With the help of the <a href="https://api.flutter.dev/flutter/material/BottomNavigationBar-class.html">flutter dev</a> documentation I converted it to this:</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">BottomNavigationBarItem</span><span class="p">(</span>
    <span class="nl">icon:</span> <span class="n">Icon</span><span class="p">(</span><span class="n">Icons</span><span class="o">.</span><span class="na">home</span><span class="p">),</span>
    <span class="nl">label:</span> <span class="s">'Home'</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>The second problem was much more worrying. When there were three buttons (as were in the blog post), everything worked fine. Though, when I added two more, all the buttons except the active one became invisible. You can see how weird it looks:</p>

<figure>
    <img src="https://trigonaminima.github.io/assets/2021-07/navbar_issue1.png" alt="" width="300" style="text-align: center; object-fit: fill || contain || cover || none || scale-down; margin: auto" />
    <figcaption style="text-align: center">All 4 icons are invisible except the active tab.</figcaption>
</figure>

<p>I looked around and found a Github <a href="https://github.com/flutter/flutter/issues/13642">issue</a> on the flutter project. The issue was that when more than three <code class="language-plaintext highlighter-rouge">BottomNavigationBarItem</code> items are there, and if unspecified, the type of the <code class="language-plaintext highlighter-rouge">BottomNavigationBar</code> changes <code class="language-plaintext highlighter-rouge">fixed</code> to <code class="language-plaintext highlighter-rouge">shifting</code>. This change makes the text and items render in white. One of the comments explains the <a href="https://github.com/flutter/flutter/issues/13642#issuecomment-371875044">reasoning</a> behind it. So, it’s a feature, not a bug. Explicitly specifying the type as <code class="language-plaintext highlighter-rouge">fixed</code>, fixed the problem:</p>

<div class="language-dart highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">bottomNavigationBar:</span> <span class="n">BottomNavigationBar</span><span class="p">(</span>
    <span class="nl">type:</span> <span class="n">BottomNavigationBarType</span><span class="o">.</span><span class="na">fixed</span><span class="p">,</span>
    <span class="p">...</span>
<span class="p">)</span>
</code></pre></div></div>

<p>And, this is how the Minutes Homepage tabs look right now. When you open the app, the first tab is the “Track” tab. Colours are to add some dummy action on button click.</p>

<figure class="image">
    <!-- <img src="https://trigonaminima.github.io/assets/2021-07/minutes-homepage1.png" alt=""> -->
    <img src="https://trigonaminima.github.io/assets/2021-07/minutes-homepage.gif" alt="" width="300" style="margin: auto;" />
    <figcaption style="text-align: center">Minutes Homepage tabs.</figcaption>
</figure>

<p>The next step is to add trackers to the Track tab.</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-Self" /><summary type="html"><![CDATA[I’ve long since 2014 engaged in some form of Quantified Self. It started with a simple Libre Office worksheet. I tracked my finances, sleep, time spent on entertainment, walking steps, and many other things. Maintaining a worksheet had a lot of flaws. I didn’t want to put my data in a Google worksheet, so it always remained on my laptop. Sometimes, I forgot to enter the data, and other times, I couldn’t access the device. There was also a limit on what I could track through it. I wanted to know where my time was going within a day. To manually enter everything takes a lot of time. I wanted something automated. I found many useful tools while trying to automate these things.]]></summary></entry><entry><title type="html">Building and Hosting a Podcast Website for Free</title><link href="https://trigonaminima.github.io/2021/02/dspods_website/" rel="alternate" type="text/html" title="Building and Hosting a Podcast Website for Free" /><published>2021-02-18T00:00:00+00:00</published><updated>2021-02-18T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2021/02/dspods_website</id><content type="html" xml:base="https://trigonaminima.github.io/2021/02/dspods_website/"><![CDATA[<p>A few months back, I was looking for some good Data Science podcasts, and the only sources I found were blog posts with reviews about a few podcasts. Many of those podcasts were no longer active. A few hosts explicitly mentioned that on their homepages, and some just stopped creating new episodes. Apple Podcasts, Spotify, Pocketcasts - where you can subscribe to a podcast - none of them showed if a podcast was inactive. You’ve to deduce it on your own. These services also don’t tell you if a podcast is about Data Science, Machine Learning, or ML Engineering.</p>

<p>So I decided to create a website having a regularly updated list of all such podcasts. I enjoyed developing this website. This is coming from someone who has never done any decent web development work except tweaking some HTML or CSS here and there. Here’s how I made this free website.</p>

<h2 id="guiding-principles">Guiding Principles</h2>

<p>I wanted my whole setup to be uncomplicated and quick. Once the design is complete, I don’t want to touch HTML/CSS/JS files again unless for a tweak or update, and even then, it should be minimal. It should be easy to add a new podcast. Most importantly, I wanted most of the things to be automated.</p>

<h2 id="static-pages">Static Pages</h2>

<p>I have no experience with JavaScript so, I skipped any JS Frameworks. I decided to go ahead with a vanilla static site. I used <a href="https://jekyllrb.com/">Jekyll</a> to generate static pages that one can host anywhere. I am familiar with Jekyll because I use it to build this blog. Another advantage is that GitHub makes it frictionless to build and host a static website using <a href="https://pages.github.com/">GitHub pages</a>. Lastly, adding a new blog entry is straightforward: add a simple text file with YAML front matter and boom.</p>

<h2 id="website-design">Website Design</h2>

<p>I rarely start working on things from scratch. I begin with copying being the first step and then taking it beyond the initial design. Jekyll provides a lot of <a href="https://jekyllthemes.io/free">free themes</a> created by others. I started with <a href="https://www.wowthemes.net/mediumish-free-jekyll-template/">Mediumish Jekyll Theme</a> as my first step. It used <a href="https://getbootstrap.com/">Bootstrap</a> to design the interface, which was fortunate for me, as I had a tiny bit of experience in it.</p>

<p>I made significant modifications to make the website look the way I had imagined. I knew a large chunk of visitors would visit the website from their mobile devices, so I had to make the website behave properly with small screens, a.k.a. design it responsively. Making that homepage grid behave appropriately with different widths was a bunch of work. Firefox’s <a href="https://developer.mozilla.org/en-US/docs/Tools/Responsive_Design_Mode">Responsive Design Mode</a> helped a lot. With time I figured out all the parts. I learned many new things: <a href="https://developer.mozilla.org/en-US/docs/Glossary/Viewport">Viewport</a>, <a href="https://www.youtube.com/watch?v=fYq5PXgSsbE&amp;t">Flexbox</a>, the background of the <a href="https://www.smashingmagazine.com/2011/01/guidelines-for-responsive-web-design/">Responsive Design from Smashing Magazine</a>, about Media Queries, and more. I also connected with a friend from my previous company for a Flexbox issue I was facing. In the end, I was satisfied with the result.</p>

<p>Designing the website an involved process for me. How do you ensure that it looks how you intend on all devices? Is there something like <em>unit-tests</em> for a front-end or design of a website? I know there are tools like Selenium and others which can open a website in a headless browser and aid in testing its functionality. I found this Stack Overflow question on the topic - <a href="https://sqa.stackexchange.com/questions/32837/what-are-general-tips-to-test-a-static-website">What are general tips to test a static website?</a>. Most of the answers were either manual checklists or websites which evaluate your website on few metric.</p>

<p>After I had hosted the website, I evaluated my website through the following links and made a multitude of changes to improve the score:</p>

<ul>
  <li><a href="https://wave.webaim.org/">WAVE Web Accessibility Evaluation Tool</a></li>
  <li><a href="https://web.dev/measure/">Google’s Measure Tool</a> for performance, accessibility, use of best practices, and SEO.</li>
</ul>

<p>For each podcast, I display a card on the homepage with its image and other details. Minimized the large images with <a href="https://tinypng.com/">tinypng</a> and <a href="https://www.gimp.org/">GIMP</a> (learned the basics of GIMP). Minified the CSS and JS files, so they take less time to download. Added alt text for images and did most of the suggested accessibility changes. In the process of removing these latencies, I discovered the <a href="https://developers.google.com/web/tools/chrome-devtools/coverage">Coverage Tab in Chrome DevTools</a>. I was using Firefox developer tools, so I didn’t know Chrome had something like this. It was cool.</p>

<h2 id="hosting">Hosting</h2>

<p>As I mentioned earlier, I used <a href="https://pages.github.com/">GitHub Pages</a> to host my website. GitHub supports Jekyll right out of the box: I didn’t even have to upload the final HTML files. GitHub will do all of it on its own on every push to the remote repository. One problem was that the hosted website would be available at <a href="https://shivamrana.me/dspods/">shivamrana.me/dspods</a>, and I didn’t want the URL. <a href="https://www.netlify.com/">Netlify</a> to the rescue. Netlify hosts a static website and gives you a domain like <code class="language-plaintext highlighter-rouge">dspods.netlify.app</code>, where we can put anything in place of <code class="language-plaintext highlighter-rouge">dspods</code>. It also directly links with a GitHub repository. Give it build instructions and permission to pull the repository. It’ll pull the code on every push, build, and host it at the specified domain. Of course, you can use your domain name with both GitHub and Netlify.</p>

<p>So Now I’ve automated the continuous deployment pipeline. Update the code, and push it to the GitHub repository. GitHub will build it and host it at the <a href="https://shivamrana.me/dspods/">shivamrana.me/dspods</a> URL. Netlify will also build it and host it at the <a href="https://dspods.netlify.app/">dspods.netlify.app</a>. GitHub build is redundant, but I don’t know of a way to stop it. Thus, the whole pipeline is fully automated. Pretty neat, huh?</p>

<h2 id="regular-podcast-updates">Regular Podcast Updates</h2>

<p>The only thing remaining is the daily podcast updates. I can’t manually update and push the changes daily. It needs to be automated. <a href="https://docs.github.com/en/actions">GitHub Actions</a> FTW. It was a complete game-changer. Without this, I couldn’t have made this website free.</p>

<p>How does it work? Create a workflow with the instructions to get the updates, commit them, and push them to the repository. A Python script parses the RSS feed of each podcast for updates. The workflow runs daily at midnight. Once the commit is pushed, GitHub and Netlify then get on with their jobs to build and deploy the website.</p>

<p>The whole pipeline is automated now.</p>

<h2 id="analytics">Analytics</h2>

<p>I’ve also added <a href="https://analytics.google.com/analytics/web/">Google Analytics</a> to see the kind of response there is for the website. You add a small snippet to your website, and it handles all the analytics of your website.</p>

<h2 id="reception-and-whats-next">Reception and What’s Next?</h2>

<p>When I posted the link on <a href="https://www.reddit.com/r/MachineLearning/">r/MachineLearning</a>, people liked the website and gave me a lot more podcasts to add. You can read the discussion <a href="https://www.reddit.com/r/MachineLearning/comments/liz35x/d_podcasts_about_machine_learning_and_data_science/">here</a>.</p>

<p>The next thing I want to do is to make the podcast page a bit more informative. I want to provide some insights about the podcasts to the users. The users should understand what kind of things the podcast has discussed in the past.</p>

<p>I also want to work on the discoverability of the podcasts on the homepage. How can a user narrow down the podcast she may want to listen to? If you have any suggestions regarding this, I am happy to hear them.</p>

<p>My hole for this website is to become a go-to place where people find the next Data Science podcast they want to listen to.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I wanted to write about how easy it is to build a free website today. This post was my attempt at that. I hope this will be useful to you. Here’s a laundry list of the tools I’ve used to build this website:</p>

<ul>
  <li>Free static site generator: <a href="https://jekyllrb.com/">Jekyll</a></li>
  <li>Free hosting: <a href="https://pages.github.com/">Github Pages</a> and <a href="https://www.netlify.com/">Netlify</a></li>
  <li>Free domain: <a href="https://www.netlify.com/">Netlify</a></li>
  <li>Free website evaluation: <a href="https://wave.webaim.org/">WAVE Web Accessibility Evaluation Tool</a>, <a href="https://web.dev/measure/">Google’s Measure Tool</a></li>
  <li>Free website automates updates: <a href="https://docs.github.com/en/actions">GitHub Actions</a></li>
  <li>Free Analytics: <a href="https://analytics.google.com/analytics/web/">Google Analytics</a></li>
</ul>

<p>The open-source website having a collection of podcasts about Machine Learning, Data Science, and ML Engineering is available at - <a href="https://dspods.netlify.app/">DSPods</a>. The website offers filtering and search options to find podcasts according to your needs. The source code is available at the <a href="https://github.com/TrigonaMinima/dspods">GitHub Repo</a>. Adding a new podcast is easy: add a simple text file with YAML front matter.</p>]]></content><author><name>Shivam Rana</name></author><category term="Podcast" /><summary type="html"><![CDATA[A few months back, I was looking for some good Data Science podcasts, and the only sources I found were blog posts with reviews about a few podcasts. Many of those podcasts were no longer active. A few hosts explicitly mentioned that on their homepages, and some just stopped creating new episodes. Apple Podcasts, Spotify, Pocketcasts - where you can subscribe to a podcast - none of them showed if a podcast was inactive. You’ve to deduce it on your own. These services also don’t tell you if a podcast is about Data Science, Machine Learning, or ML Engineering.]]></summary></entry><entry><title type="html">Lang. Identification in Code Mixed Text - Lit. Review 3</title><link href="https://trigonaminima.github.io/2020/11/lang-id-3/" rel="alternate" type="text/html" title="Lang. Identification in Code Mixed Text - Lit. Review 3" /><published>2020-11-06T00:00:00+00:00</published><updated>2020-11-06T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/11/lang-id-3</id><content type="html" xml:base="https://trigonaminima.github.io/2020/11/lang-id-3/"><![CDATA[<p>I am reviewing the literature available on language identification for multilingual documents, focusing on Indic languages. I’ll try to cover it in chronological order, but there might be a few misses here and there. After a decent coverage of the research, I expect to have enough understanding to discuss the challenges present in this task in a code-switched setting and its importance.</p>

<p><strong>What is Language Identification, you ask?</strong></p>

<p><a href="https://en.wikipedia.org/wiki/Code-switching">Code-switching</a> (and code-mixing) is the use of two or more languages in a conversation often employed by multilingual users in informal media like personal chats, Twitter, Facebook, Reddit. Language identification in the code-mixed text is the process of labeling each word with the language it belongs to. For example, in a <a href="/2018/06/hinglish-and-transliteration/">Hinglish text</a> (code-switching between Hindi and English), a sentence like <code class="language-plaintext highlighter-rouge">Hindi ke liye it makes no sense</code> should be labeled as <code class="language-plaintext highlighter-rouge">Hindi\Hi ke\Hi liye\Hi it\En makes\En no\En sense\En</code>.</p>

<p><br /></p>
<hr />

<p><br /></p>

<ol>
  <li>2013 - <a href="/2020/10/lang-id-1/">Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods</a></li>
  <li>2013 - <a href="/2020/10/lang-id-2/">Query word labeling and Back Transliteration for Indian Languages: Shared task system description</a></li>
  <li>2014 - Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System (*this post*)</li>
</ol>

<p><br /></p>
<hr />

<p><br /></p>

<h3 id="publication">Publication</h3>

<p><a href="https://www.aclweb.org/anthology/W14-3908/">Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System</a></p>

<h3 id="summary">Summary</h3>

<ul>
  <li>The objective is to <strong>identify the individual language</strong> of each word in a code-mixed text for the following four languages: English-Spanish (En-Es), English-Nepali (En-Ne), English-Mandarin (En-Cn), and Standard Arabic-Arabic (Ar-Ar) Dialects.</li>
  <li>They proposed a <strong>CRF-based approach</strong> inspired by the performance of CRFs in the previous work (discussed <a href="/2020/10/lang-id-1/">here</a>.)</li>
  <li>The methods developed uses various <strong>token-based features</strong> which can be easily replicated across languages as they are not language-specific.</li>
  <li>The system relies on annotated data for <strong>supervised training</strong>, and also lexicon of languages, if available.</li>
  <li>The system achieves <strong>accuracy</strong> ranging from <strong>80%-95%</strong> across the four language pairs.</li>
</ul>

<h3 id="key-challenges-addressed">Key Challenges Addressed</h3>

<ul>
  <li><strong>Easy Replication</strong>: The features used for training the model are computed using the tokens themselves as they are based on the token context, language lexicon, other character features, and token n-grams. This simplicity enables the easy replication of the methods across various languages.</li>
</ul>

<h3 id="constraintsassumptions">Constraints/Assumptions</h3>

<ul>
  <li>Since the two languages present in the documents are known, models are aware of the <strong>two languages a priori.</strong></li>
  <li>The system assumes that the <strong>annotated training data</strong> is available for each language pair. In this work, the shared task organizers provided the annotated tweets.</li>
</ul>

<h3 id="dataset">Dataset</h3>

<p>They considered the following four language pairs:</p>

<ol>
  <li>English-Spanish</li>
  <li>English-Nepali</li>
  <li>English-Mandarin</li>
  <li>Standard Arabic-Arabic</li>
</ol>

<ul>
  <li><strong>Released Data (Training and Testing)</strong>
    <ul>
      <li>
        <p>Twitter data downloaded using the provided Ruby script. The following table summarizes the number of tweets downloaded for each language.</p>

        <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-11/post3_released.png" alt="" style="display:block;text-align:center" />
  <!-- <figcaption style="text-align: center">Figure 1:</figcaption> -->
  </figure>
      </li>
      <li>
        <p>Pre-processing</p>
        <ul>
          <li>They excluded the deleted or private tweets from the training set.</li>
          <li>They fixed a few space tokenization errors by replacing them with an underscore.</li>
        </ul>
      </li>
    </ul>
  </li>
  <li><strong>External Training Data</strong>
    <ul>
      <li>Named Entities
        <ul>
          <li>English from <a href="https://wiki.dbpedia.org/">DBpedia</a> instance types - Agent, Award, Device, Holiday, Language, MeansOfTransportation, Name, PersonFunction, Place, and Work</li>
          <li>Spanish from <a href="https://wiki.dbpedia.org/">DBpedia</a> instance types - Agent, Award, Device, Holiday, Language, MeansOfTransportation, Name, PersonFunction, Place, and Work</li>
        </ul>
      </li>
      <li>Word frequency lists from <a href="https://www.aclweb.org/anthology/L06-1396/">Corpus Portal for Search in Monolingual Corpora</a>
        <ul>
          <li>Available only for English and Spanish languages.</li>
          <li>Pre-processing: they removed the words containing special characters and numbers from the list.</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<h3 id="methodology">Methodology</h3>

<h4 id="character-n-grams-based-feature-engineering">Character n-grams based Feature Engineering</h4>

<p>The character n-gram classifiers are used as features in the final system.</p>

<ul>
  <li><strong>Objective</strong>: Two character n-grams classifiers for each language-pair</li>
  <li><strong>Classes</strong>: For each language in the pair, they trained a model with binary class classification into two categories: <code class="language-plaintext highlighter-rouge">lang1</code> and <code class="language-plaintext highlighter-rouge">others</code>.</li>
  <li><strong>Training data</strong>:
    <ol>
      <li>6000 +ve examples randomly sampled from the training set for <code class="language-plaintext highlighter-rouge">lang1</code>.</li>
      <li>6000 -ve samples randomly sampled from both the training set and word lists of multiple languages.</li>
    </ol>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-11/post3_train_data.png" alt="" style="display:block;text-align:center" />
  <figcaption style="text-align: center">Data to train character n-gram classifiers</figcaption>
  </figure>
  </li>
  <li><strong>Eval data</strong>: Nothing mentioned</li>
  <li><strong>Feature Engineering</strong>:
    <ol>
      <li>Character unigrams</li>
      <li>Character bigrams</li>
      <li>Character trigrams</li>
      <li>Character 4-grams</li>
      <li>Character 5-grams</li>
      <li>Full word</li>
    </ol>
  </li>
  <li><strong>Feature Selection</strong>:
    <ul>
      <li>All the features were selected, as <a href="/2020/10/lang-id-1/">past research</a> showed them to be effective.</li>
    </ul>
  </li>
  <li><strong>Model Selection</strong>:
    <ul>
      <li>The model considered based on <a href="/2020/10/lang-id-1/">previous research</a>: Maximum Entropy (Logistic Regression)</li>
      <li><a href="http://mallet.cs.umass.edu/index.php">MALLET</a> was used to train all the classifiers.</li>
    </ul>
  </li>
</ul>

<h4 id="final-labeling-system">Final Labeling System</h4>

<ul>
  <li><strong>Objective</strong>: Training a language identification system for each language pair.</li>
  <li><strong>classes</strong>: For each language pair, the labeling system classifies each token into the following six classes:
    <ol>
      <li><code class="language-plaintext highlighter-rouge">lang1</code></li>
      <li><code class="language-plaintext highlighter-rouge">lang2</code></li>
      <li><code class="language-plaintext highlighter-rouge">mixed</code>- tokens with morphemes from both lang1 and lang2.</li>
      <li><code class="language-plaintext highlighter-rouge">ne</code>- named entities</li>
      <li><code class="language-plaintext highlighter-rouge">ambiguous</code>- a word whose label the model cannot determine with certainty in the given context</li>
      <li><code class="language-plaintext highlighter-rouge">others</code>- smileys, punctuations, etc.</li>
    </ol>
  </li>
  <li><strong>Training Data</strong>
    <ul>
      <li>Released data from the shared task (4 language pairs)</li>
      <li>Named entities dataset</li>
      <li>Word frequencies</li>
    </ul>
  </li>
  <li><strong>Eval data</strong>: Released data from the shared task (4 language pairs)</li>
  <li>
    <p><strong>Feature Engineering</strong>: The following table shows the list of features created and finally selected for each language pair.</p>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-11/post3_features.png" alt="" style="display:block;text-align:center" />
  <!-- <figcaption style="text-align: center">Figure 1:</figcaption> -->
  </figure>

    <p>NA: features not applicable or not available.
  B/U: Bigram/Unigram feature including, the current token.</p>
  </li>
  <li>
    <p><strong>Feature Selection</strong>: Authors used the 3-fold cross-validation on released training sets to come up with the optimal features reported in the above table for each language pair.</p>

    <ol>
      <li>Using all the features of the previous tokens in the bigram context hurt the performance.</li>
      <li>The context feature of the previous three and next three tokens was useful.</li>
      <li>For En-Es, the character n-gram classifier feature was useful.</li>
      <li>For En-Cn, special character features were useful.</li>
      <li>For En-Ne, no particular feature set influenced the classification.</li>
      <li>Lexicon features were only available for English and Spanish.</li>
    </ol>
  </li>
  <li><strong>Model Selection</strong>:
    <ul>
      <li>Based on the <a href="/2020/10/lang-id-1/">previous research</a> in language identification, they considered the CRF++ model.</li>
      <li><a href="http://mallet.cs.umass.edu/index.php">MALLET</a> was used to train all the classifiers.</li>
    </ul>
  </li>
  <li>
    <p><strong>Model Evaluation</strong>: The evaluation metric is the <strong>accuracy</strong> in a 3-fold cross-validation on the training sets. The following table gives the final accuracies for various combinations of features.</p>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-11/post3_model_eval.png" alt="" style="display:block;text-align:center" />
  <!-- <figcaption style="text-align: center">Figure 1:</figcaption> -->
  </figure>

    <ol>
      <li>Less accuracy of Ar-Ar system
        <ul>
          <li>Lexicon features of dialectal Arabic were unavailable.</li>
          <li>Both dialects use the same script.</li>
          <li>Character n-gram classifier features reduced the accuracy. The hypothesis is that the dialects may not show a drastic difference in their character n-gram distributions.</li>
        </ul>
      </li>
      <li>The En-Cn dataset had En words written in Roman script and Cn words in Chinese script. The feature <code class="language-plaintext highlighter-rouge">CHR0</code> (is English alphabet word?) modeled this signal present in the script.</li>
    </ol>
  </li>
  <li>
    <p><strong>Error Analysis</strong>:</p>

    <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-11/post3_fscores.png" alt="" style="display:block;text-align:center" />
  <!-- <figcaption style="text-align: center">Figure 1:</figcaption> -->
  </figure>

    <ol>
      <li>Named Entities (NEs): The F-score of named entities is much lower than the F-scores of lang1 and lang2. Reasons for the discrepancy are:
        <ul>
          <li>Lack of accurate NE identification systems.</li>
          <li>Lexicon features only available for English and Spanish.</li>
          <li>Informal nature of the sentences - not capitalized or spelled properly.</li>
        </ul>
      </li>
      <li>We can ignore the <code class="language-plaintext highlighter-rouge">ambiguous</code> and <code class="language-plaintext highlighter-rouge">mixed</code> class errors because of their rarity in the datasets. They don’t contribute much to the accuracy of the system.</li>
      <li>Reasons for the lower accuracy of Ar-Ar pair (<code class="language-plaintext highlighter-rouge">lang1</code>: Arabic; <code class="language-plaintext highlighter-rouge">lang2</code>: Dialectal Arabic):
        <ul>
          <li>The model was only trained on context and word features (and not on lexicon or character n-grams.)</li>
          <li>Fewer words of dialectal Arabic in both training and testing: couldn’t train a reliable model.</li>
          <li>Due to distributional skew, the model learned to label the tokens as <code class="language-plaintext highlighter-rouge">lang1</code> with a high probability.</li>
          <li>F-score for <code class="language-plaintext highlighter-rouge">lang2</code> is 15.8, and that of <code class="language-plaintext highlighter-rouge">lang1</code> is 94.2%: this shows that the identification of <code class="language-plaintext highlighter-rouge">lang2</code> was full of errors.
            <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-11/post3_ar_ar.png" alt="" style="display:block;text-align:center" />
  <figcaption style="text-align: center">Distribution of classes in Ar-Ar released data sets.</figcaption>
  </figure>
          </li>
        </ul>
      </li>
      <li>There was a drop in accuracy for En-Ne pair (<code class="language-plaintext highlighter-rouge">lang1</code>: English; <code class="language-plaintext highlighter-rouge">lang2</code>: Nepali) in surprise data when compared to training and test data. Possible reasons are:
        <ul>
          <li>The difference in class distribution</li>
          <li>Genre/style of two datasets: surprise data contained song titles of Nepali songs. Many words were labeled as <code class="language-plaintext highlighter-rouge">lang2</code> by the system, but gold labels identified them as NEs. This classification error was debatable.</li>
          <li>Results were obtained on only 1,087 tokens, which can’t be used to make any strong claims or conclusions.</li>
        </ul>
      </li>
    </ol>
  </li>
</ul>

<h3 id="insightsthoughts-on-the-paper">Insights/Thoughts on the paper</h3>

<ol>
  <li>They formulated the problem as a <strong>sequence labeling problem</strong>. Can it be defined in some other way?</li>
  <li>The problem becomes <strong>trivial</strong> if languages present in the document do not share the character set. I discussed this in the first paper in this series, and we can see it in action in the current work. For the pair En-Cn, because the Chinese words are present in the original script, it became efortless for the model to identify the language.</li>
  <li>The authors performed a lot of <strong>feature engineering</strong> (token context, lexicon, derived from the tokens, n-grams) and exhaustively evaluated the models on those features for each of the language pair. They picked the feature set utilized in the previous research work and built more on top of them. They found the importance of different features for different languages.</li>
  <li>The employed model is not complex, and the features are simple and easily derived from the tokens themselves. The training data used was also small. As the authors pointed out, the developed methods are <strong>easily replicable</strong> for other languages. They are <strong>scalable</strong> to label large amounts of data in production systems too.</li>
  <li>The system should have <strong>high throughput</strong>, as the classification is built with simple features and compute efficient models.</li>
  <li>The described process is <strong>supervised</strong> because:
    <ul>
      <li>The training data employed were annotated.</li>
      <li>It also relies on the lexicon of languages and is subject to the availability of those lexicon resources. Where ever the lexicon resources were not available, it hurt the performance of the classifiers.</li>
    </ul>

    <p>There was little data available for training. And, many of the considered languages have many speakers and thus have a lot of language resources available. What about the low-resources languages? In the last two papers, authors had used monolingual corpora to train their models, and hence there was no need for annotated data, but here the situation is different. Annotating data for training purposes is prohibitive. <strong>Can we cheaply create the annotated data?</strong></p>
  </li>
  <li>Detecting <strong>Named Entities (NEs)</strong> was the most challenging problem faced by researchers. The authors in the <a href="/2020/10/lang-id-1/">first paper</a> suggested that if we create a separate class for NEs, then the accuracy of the system should improve. In the current work, the authors did this. It seems like the accuracy has improved, but there were still errors in differentiating between a language token and a NE. Here also, the authors suggest that a <strong>better NE detection system</strong> will improve the system.</li>
  <li>The researchers reported their <strong>system accuracies</strong> for each of the language pairs in clear terms. There was no ambiguity in the language used to describe the methods and the results.</li>
  <li>The authors <strong>only considered the five languages</strong> which were part of the task. Since the proposed methods are language agnostic, we can extend them for other languages as well. I think it is easier said than done for the following reasons:
    <ul>
      <li>There should be annotated data available for the other languages, which is not always guaranteed as it’s expensive to do the annotation.</li>
      <li>The lexicon resources are also not guaranteed to be available. We saw the effect of this on the classification accuracy of the Ar-Ar language pair. It would have impacted the accuracy of En-Cn as well if the script had been the same.</li>
      <li>If the authors had considered transliterated code-switched text, they would never found the lexicon resources for those languages.</li>
    </ul>
  </li>
  <li>Since the shared task description aimed to label the tweets, this methodology is suitable for <strong>short-text-like content</strong> generated on social media. Although, the researchers didn’t discuss and work on challenges like word normalization (misspellings or acronyms) that often pervade social media text.</li>
  <li>The data and language pairs considered by the authors contained <strong>code-mixed data, but not transliterated data</strong> except maybe the En-Ne language pair. The presented methods have worked well for the code-switched data without adding any other complexities in the mix.</li>
  <li>Any discussions on the <strong>transliteration</strong> were completely skipped.</li>
</ol>]]></content><author><name>Shivam Rana</name></author><category term="NLP" /><category term="Publication" /><summary type="html"><![CDATA[I am reviewing the literature available on language identification for multilingual documents, focusing on Indic languages. I’ll try to cover it in chronological order, but there might be a few misses here and there. After a decent coverage of the research, I expect to have enough understanding to discuss the challenges present in this task in a code-switched setting and its importance.]]></summary></entry><entry><title type="html">Lang. Identification in Code Mixed Text - Lit. Review 2</title><link href="https://trigonaminima.github.io/2020/10/lang-id-2/" rel="alternate" type="text/html" title="Lang. Identification in Code Mixed Text - Lit. Review 2" /><published>2020-10-23T00:00:00+00:00</published><updated>2020-10-23T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/10/lang-id-2</id><content type="html" xml:base="https://trigonaminima.github.io/2020/10/lang-id-2/"><![CDATA[<p>I am reviewing the literature available on language identification for multilingual documents, focusing on Indic languages. I’ll try to cover it in chronological order, but there might be a few misses here and there. After a decent coverage of the research, I expect to have enough understanding to discuss the challenges present in this task in a code-switched setting and its importance.</p>

<p><strong>What is Language Identification, you ask?</strong></p>

<p><a href="https://en.wikipedia.org/wiki/Code-switching">Code-switching</a> (and code-mixing) is the use of two or more languages in a conversation often employed by multilingual users in informal media like personal chats, Twitter, Facebook, Reddit. Language identification in the code-mixed text is the process of labeling each word with the language it belongs to. For example, in a <a href="/2018/06/hinglish-and-transliteration/">Hinglish text</a> (code-switching between Hindi and English), a sentence like <code class="language-plaintext highlighter-rouge">Hindi ke liye it makes no sense</code> should be labeled as <code class="language-plaintext highlighter-rouge">Hindi\Hi ke\Hi liye\Hi it\En makes\En no\En sense\En</code>.</p>

<p><br /></p>
<hr />

<p><br /></p>

<ol>
  <li>2013 - <a href="/2020/10/lang-id-1/">Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods</a></li>
  <li>2013 - Query word labeling and Back Transliteration for Indian Languages: Shared task system description (*this post*)</li>
  <li>2014 - <a href="/2020/11/lang-id-3/">Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System</a></li>
</ol>

<p><br /></p>
<hr />

<p><br /></p>

<h3 id="publication">Publication</h3>

<p><a href="https://www.microsoft.com/en-us/research/publication/query-word-labeling-and-back-transliteration-for-indian-languages-shared-task-system-description/">Query word labeling and Back Transliteration for Indian Languages: Shared task system description</a></p>

<h3 id="summary">Summary</h3>

<ul>
  <li>The objective is to identify the <strong>word level language</strong> for Indian languages written in Roman script mixed with the English language.</li>
  <li>They have <strong>back-transliterated</strong> Indian language words into the native Indic scripts.</li>
  <li>The methodology is to build <strong>weakly-supervised models</strong> with monolingual samples together with word frequency and context switching probability from Indian language (IL) to English (Eng)</li>
  <li><a href="http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/">FIRE-2013 shared task on Transliterated Search</a> targets labeling the individual words of a query with their original language and then also asks to back-transliterate non-English words to their native scripts. The proposed system achieved the <strong>best performing results</strong> in this shared task.</li>
</ul>

<h3 id="key-challenges-addressed">Key Challenges Addressed</h3>

<ul>
  <li><strong>Indic languages</strong>: There is no previous work on building systems for query labeling of text written in an Indian language code-mixed with English.</li>
  <li><strong>Transliterated data</strong>: The real challenge comes when the text is in Roman transliterated form, and we are required to label and back-transliterate it. To my knowledge, this is the first paper that tries to tackle both the challenges and create one end-to-end system.</li>
  <li><strong>Sparse training data</strong>: Getting the annotated multi-lingual data is expensive and hence not readily available to the researchers. This work tries to solve that by creating models from small mono-lingual transliterated datasets and then using them on the actual multi-lingual data.</li>
</ul>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2020-10/post2_objective.png" alt="" style="display:block;text-align:center" />
</figure>

<h3 id="constraintsassumptions">Constraints/Assumptions</h3>

<ul>
  <li>All of the languages considered in the publication - Hindi, Gujarati, Bangla - belong to the Indo-Aryan family. This leads to the assumption that the words are <strong>pronounced similarly</strong> in all three languages.</li>
  <li>Since the model training requires <strong>monolingual data in the transliterated form</strong>, we assume we have that kind of training data. In this work, task organizers provided this data.</li>
  <li>Since the two languages present in the documents are known, models are aware of the <strong>two languages a priori.</strong></li>
</ul>

<h3 id="dataset">Dataset</h3>

<p>The three Asian languages covered by the researchers were - Hindi, Gujarati, Bangla.</p>

<ul>
  <li><strong>Training Data</strong>
    <ul>
      <li>Monolingual samples</li>
      <li>Word frequency for all three languages</li>
      <li>Roman transliterations provided as a part of FIRE-2013 shared task</li>
    </ul>
  </li>
  <li><strong>Evaluation Data</strong>
    <ul>
      <li>Labeled code-mixed queries for each language provided in FIRE-2013 shared task</li>
    </ul>
  </li>
</ul>

<h3 id="methodology">Methodology</h3>

<h4 id="baseline-setup">Baseline Setup</h4>

<ul>
  <li><strong>Objective</strong>: Language identification of each word in a query</li>
  <li><strong>Classes</strong>: They trained three different models for each Indian language with binary class classification into two categories: <code class="language-plaintext highlighter-rouge">LI</code> (Indian Language) and <code class="language-plaintext highlighter-rouge">Eng</code>.</li>
  <li><strong>Training Data</strong>: Monolingual samples of the languages</li>
  <li><strong>Feature Engineering</strong>:
    <ol>
      <li>Character unigrams</li>
      <li>Character bigrams</li>
      <li>Character trigrams</li>
      <li>Character 4-grams</li>
      <li>Character 5-grams</li>
      <li>Full word</li>
    </ol>
  </li>
  <li><strong>Feature Selection</strong>:
    <ul>
      <li>Using all available features - {1,2,3,4,5}-grams, and the complete word - gave the best accuracy.</li>
      <li>For brevity, they didn’t present the evaluation results.s</li>
    </ul>
  </li>
  <li><strong>Model Selection</strong>:
    <ul>
      <li>Models considered: Naive Bayes, Maximum Entropy (Logistic Regression), and Decision tree.</li>
      <li><a href="http://mallet.cs.umass.edu/index.php">MALLET</a> was used to train all the three classifiers on the training data for each Indian language using the best set of features determined in the feature selection step.</li>
    </ul>
  </li>
  <li><strong>Model Evaluation</strong>:
    <ul>
      <li>The evaluation metric is the <strong>accuracy</strong>.</li>
      <li>Models were evaluated for varying training sizes from 100 to 5000.</li>
      <li>For Hindi and Gujarati, the best model was <strong>Maximum Entropy</strong>.</li>
      <li><strong>Naive Bayes</strong> performed better for Bangla.</li>
      <li>
        <p>Shown below is the plot of learning curves for all four models as the training size changes from 10 to 1000.</p>

        <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-10/post2_lid1.png" alt="Learning curves for NaiveBayes, MaxEnt and DecisionTree on word labeling for Hindi, Gujarati and Bangla language on development data" style="display:block;text-align:center" />
  </figure>
      </li>
    </ul>
  </li>
</ul>

<h4 id="improving-baseline">Improving Baseline</h4>

<ul>
  <li><strong>Feature Engineering</strong>: In addition to the {1,2,3,4,5}-grams and the whole word as features, they tested the effect of the following two factors on the classification accuracy of the baseline:
    <ol>
      <li>Context switch probability</li>
      <li>Monolingual frequency factor</li>
    </ol>
  </li>
  <li><strong>Frequency-based Classification</strong>: After classifying the token language using the baseline method, get the final label using the below rules-
    <ol>
      <li>Label as <code class="language-plaintext highlighter-rouge">LI</code>, if the classification was <code class="language-plaintext highlighter-rouge">Eng</code>, the confidence of word being English is greater than or equal to \(\max({0.98p, 0.98})\) (\(p\) is context switching probability) and the word frequency is less than 20.</li>
      <li>Label as <code class="language-plaintext highlighter-rouge">LI</code>, if the classification was <code class="language-plaintext highlighter-rouge">Eng</code>, the length of the word is 2 and the word frequency is less than 50.</li>
      <li>Label as <code class="language-plaintext highlighter-rouge">Eng</code> if the word contains special characters (e.g. &amp;), numerals (e.g. 3D) or the word is all in capitals (e.g. MBA).</li>
      <li>For all the remaining cases, the label will remain the same as what the baseline classifier gave for all the remaining cases.</li>
    </ol>

\[\begin{equation}
          C'(w) = \begin{cases}
                  \mathrm{L}, &amp; C(w) = \mathrm{Eng} \text{ &amp; } \mathrm{conf}(E, w) \ge \max({0.98p, 0.98}) \text{ &amp; } \mathrm{freq}(w) &lt; 20\\
                  \mathrm{L}, &amp; C(w) = \mathrm{Eng} \text{ &amp; }  \mathrm{length}(w) = 2  \text{ &amp; } \mathrm{freq}(w) &lt; 50\\
                  \mathrm{Eng}, &amp; w \in S \cup N \cup A \\
                  C(w), &amp; otherwise
          \end{cases}
      \end{equation}\]

    <p>Where, \(C'(w)\) is updated classifier, \(C(w)\) is baseline classifier, \(\mathrm{conf}(E, w)\) is the confidence of word \(w\) being English, \(p\) is context switching probability, \(S\) set of special characters, \(N\) set of numerals, \(A\) set of all capital letters.</p>

    <p>There was no explanation in the paper as to how they selected the threshold values and the various conditions.</p>
  </li>
  <li><strong>Hyper-parameter Tuning</strong>:
    <ul>
      <li>By varying the context-switching probability \(p\), one can also calculate the optimal context switching probabilities for the model. In the publication, they also show the changes in model accuracies when \(p\) changes from 0.6 to 0.9 for each language.</li>
    </ul>
  </li>
</ul>

<h4 id="back-transliteration">Back-transliteration</h4>

<ul>
  <li><strong>Objective</strong>: Back-transliterate the non-English words of the query to their original script.</li>
  <li><strong>Training Data</strong>
    <ul>
      <li><strong>Hindi and Gujarati</strong>: Hindi word-list created from monolingual data.</li>
      <li><strong>Bangla</strong>: Hindi word-list concatenated with Bangla word-list converted to Hindi using Indic character mapping.</li>
      <li>They couldn’t do the same for Gujarati because very few words were available.</li>
    </ul>
  </li>
  <li><strong>Model Selection</strong>:
    <ul>
      <li><strong>MSRI Name Search tool</strong> based on hash-functions for similarity search across domains. [<a href="https://www.microsoft.com/en-us/research/publication/learning-hash-functions-for-cross-view-similarity-search/">described in the linked research</a>]</li>
    </ul>
  </li>
  <li><strong>Inference</strong>:
    <ul>
      <li><strong>Hindi</strong>:
        <ul>
          <li>The MSRI tool directly gives the Hindi transliteration.</li>
        </ul>
      </li>
      <li><strong>Gujarati</strong>:
        <ul>
          <li>Get the Hindi transliteration using the MSRI tool.</li>
          <li>Convert Hindi transliteration to Gujarati using Indic character mapping.</li>
        </ul>
      </li>
      <li><strong>Bangla</strong>:
        <ul>
          <li>Get the Hindi transliteration using the MSRI tool.</li>
          <li>Convert Hindi transliteration to Bangla using Indic character mapping.</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<h3 id="final-systems">Final Systems</h3>

<h4 id="msri-1">MSRI-1</h4>

<ul>
  <li><strong>Language Label Inference</strong>: classifier output + frequency factor</li>
  <li><strong>Back-transliteration</strong>: trained on Hindi data</li>
</ul>

<h4 id="msri-2-best-performer">MSRI-2 (Best Performer)</h4>

<ul>
  <li><strong>Language Label Inference</strong>: classifier output + frequency factor + context-switch probability</li>
  <li><strong>Back-transliteration</strong>: trained on Hindi data</li>
</ul>

<h4 id="msri-3">MSRI-3</h4>

<ul>
  <li><strong>Language Label Inference</strong>: classifier output + frequency factor + context-switch probability</li>
  <li><strong>Back-transliteration</strong>:
    <ul>
      <li>For Hindi and Gujarati, the model is trained on Hindi data</li>
      <li>For Bangla, the model is trained on Hindi+Bangla data</li>
    </ul>
  </li>
</ul>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2020-10/post2_dev_res.png" alt="Learning curves for NaiveBayes, MaxEnt and DecisionTree on word labeling for Hindi, Gujarati and Bangla language on development data" style="display:block;text-align:center" />
<figcaption style="text-align: center">Experiments on Development Set</figcaption>
</figure>

<figure class="image">
<img src="https://trigonaminima.github.io/assets/2020-10/post2_test_res.png" alt="Learning curves for NaiveBayes, MaxEnt and DecisionTree on word labeling for Hindi, Gujarati and Bangla language on development data" style="display:block;text-align:center" />
<figcaption style="text-align: center">Results on Test Set</figcaption>
</figure>

<ul>
  <li>LA: Labeling accuracy</li>
  <li>Prob: Context probability parameter</li>
  <li>TF: Transliteration F-score</li>
  <li>TQM: % of queries that had exact labeling</li>
</ul>

<h4 id="error-analysis">Error Analysis</h4>

<p><strong>Labeling Errors</strong></p>

<ul>
  <li>Not enough n-gram information
    <ul>
      <li>Short words (I, ve);</li>
      <li>Ambiguous words (the, ate);</li>
      <li>Erroneous words (emosal).</li>
    </ul>
  </li>
  <li>Arbitrary rule
    <ul>
      <li>Treating all mixed-numeral tokens as English words (zara2, duwan2)</li>
    </ul>
  </li>
</ul>

<p><strong>Back-transliteration Errors</strong></p>

<ul>
  <li>Phonological variations exhibited by Gujarati and Bangla when compared to Hindi
    <ul>
      <li>In Bangla, <code class="language-plaintext highlighter-rouge">a</code> is frequently pronounced as <code class="language-plaintext highlighter-rouge">o</code>.</li>
      <li>In Gujarati, <code class="language-plaintext highlighter-rouge">na</code> at the end of a word is sometimes pronounced as <code class="language-plaintext highlighter-rouge">nna</code>.</li>
      <li>Here, the model trained on Hindi data fails due to slight phonetic differences between Hindi and other languages.</li>
    </ul>
  </li>
  <li>Errors in the development set</li>
</ul>

<h3 id="insightsthoughts-on-the-paper">Insights/Thoughts on the paper</h3>

<ol>
  <li>The problem was formulated as a <strong>sequence labeling problem</strong>. Can it be defined in some other way?</li>
  <li>The problem becomes <strong>trivial</strong> if languages present in the document do not share the character set. In the work, the researchers have tackled the non-trivial problem where even though the languages’ original script is different, because of the use of transliteration while code-switching, the script became same.</li>
  <li>The authors didn’t conduct a lot of <strong>feature engineering</strong>. They just selected the features proposed in the previous research work and built their systems.</li>
  <li>The model achieved a decent accuracy with small-sized training data. I think the system should be <strong>easily scalable</strong> in the production environment for multiple languages.</li>
  <li>The system should have <strong>high throughput</strong>, as the classification is built with simple and compute efficient models, and the back-transliteration system is based on the hash function based similarity search.</li>
  <li>The described process is <strong>weakly-supervised</strong>:
    <ul>
      <li>Models are trained on two <strong>monolingual</strong> example texts, thus only learning to classify a word into one of the two languages.</li>
      <li>Any sequential dependencies between words must be learned by the model on its own because of the lack of any particular features which might inform the model about the dependencies.</li>
      <li>The inference is performed on the sequences from a multilingual document.</li>
    </ul>
  </li>
  <li>Handling <strong>Named Entities (NEs)</strong>: In the previous paper, authors observed many errors because of the wrong labeling of NEs. However, in the current research, the authors haven’t mentioned this issue. Possible reasons-
    <ul>
      <li>Since Indic names have associated meanings and are a part of the vocabulary, the monolingual training data might contain many of the NEs present in the development set provided by the task organizers.</li>
      <li>The development set provided all the language-specific names (Indian names) may be labeled as Indic language words, and thus, the model also learned this pattern.</li>
      <li>The data may not contain many non-Indic NEs like foreign names and places.</li>
    </ul>
  </li>
  <li>The researchers reported the <strong>model performance</strong> results for <strong>each language</strong>, making the adopted methodology lucid. It was not the case in the previous paper I reviewed.</li>
  <li>The authors <strong>only covered the three languages</strong> which were part of the task. They skipped the discussion of extending the methods for multiple languages where languages might not be similar in pronunciation or not have enough data.</li>
  <li>Since the shared task description aimed to label the search queries, this methodology is suitable for <strong>short-text-like content</strong> generated on social media. Although language distribution might be different in these queries and social media text, nonetheless, the methods presented here should work, as the models are weakly-supervised, and there is no dependency on the distribution or the context of words. The researchers didn’t discuss and work on challenges like word normalization (misspellings or acronyms) that often pervade search queries and other such sort-form text.</li>
  <li>Queries (or documents) in the dataset given by the shared task organizers only focused on <strong>code-mixed data</strong> in Latin script. In the previous paper review, I had my doubts if that method will work properly on the code-mixed text. This work employs that same method, albeit, with a few modifications, it was quite successful in the code-mixed text.</li>
  <li>The <strong>evaluation of the back-transliteration</strong> method wasn’t discussed. With the assumption that there is a pronunciation similarity between the languages considered in this work, the process of back-transliteration became a bit easy where Hindi is the base language, and the other two languages are back-transliterated from the Hindi back-transliterations. This process can’t be adopted when other languages vary a lot in pronunciation or vocabulary. We’ll need to create new methods of doing the back-transliteration. The authors reported the errors the MSRI name search back-transliteration system made, but they didn’t discuss how accurate the system was.</li>
</ol>]]></content><author><name>Shivam Rana</name></author><category term="NLP" /><category term="Publication" /><summary type="html"><![CDATA[I am reviewing the literature available on language identification for multilingual documents, focusing on Indic languages. I’ll try to cover it in chronological order, but there might be a few misses here and there. After a decent coverage of the research, I expect to have enough understanding to discuss the challenges present in this task in a code-switched setting and its importance.]]></summary></entry><entry><title type="html">Lang. Identification in Code Mixed Text - Lit. Review 1</title><link href="https://trigonaminima.github.io/2020/10/lang-id-1/" rel="alternate" type="text/html" title="Lang. Identification in Code Mixed Text - Lit. Review 1" /><published>2020-10-08T00:00:00+00:00</published><updated>2020-10-08T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/10/lang-id-1</id><content type="html" xml:base="https://trigonaminima.github.io/2020/10/lang-id-1/"><![CDATA[<p>I am reviewing the literature available on language identification for multilingual documents, focusing on Indic languages. I’ll try to cover it in chronological order, but there might be a few misses here and there. After a decent coverage of the research, I expect to have enough understanding to discuss the challenges present in this task in a code-switched setting and its importance.</p>

<p><strong>What is Language Identification, you ask?</strong></p>

<p><a href="https://en.wikipedia.org/wiki/Code-switching">Code-switching</a> (and code-mixing) is the use of two or more languages in a conversation often employed by multilingual users in informal media like personal chats, Twitter, Facebook, Reddit. Language identification in the code-mixed text is the process of labeling each word with the language it belongs to. For example, in a <a href="/2018/06/hinglish-and-transliteration/">Hinglish text</a> (code-switching between Hindi and English), a sentence like <code class="language-plaintext highlighter-rouge">Hindi ke liye it makes no sense</code> should be labeled as <code class="language-plaintext highlighter-rouge">Hindi\Hi ke\Hi liye\Hi it\En makes\En no\En sense\En</code>.</p>

<p><br /></p>
<hr />

<p><br /></p>

<ol>
  <li>2013 - Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods (*this post*)</li>
  <li>2013 - <a href="/2020/10/lang-id-2/">Query word labeling and Back Transliteration for Indian Languages: Shared task system description</a></li>
  <li>2014 - <a href="/2020/11/lang-id-3/">Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System</a></li>
</ol>

<p><br /></p>
<hr />

<p><br /></p>

<h3 id="publication">Publication</h3>

<p><a href="https://www.aclweb.org/anthology/N13-1131/">Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods</a></p>

<h3 id="summary">Summary</h3>

<ul>
  <li>It formulated the task of language identification in a multi-lingual document as a <strong>sequence labeling problem</strong>.</li>
  <li>It used <strong>weakly supervised or semi-supervised models</strong> for sequence labeling due to the scarcity of available data.</li>
  <li>They found <strong>CRF</strong> trained with Generalized Expectation (GE) criteria performing the best.</li>
  <li>The experiments showed that the best predictors were <strong>{1,2,3,4,5}-grams</strong> and the individual word.</li>
  <li>It is the <strong>first research work</strong> that discussed the identification of languages in multi-lingual documents.</li>
</ul>

<h3 id="key-challenges-addressed">Key Challenges Addressed</h3>

<ul>
  <li><strong>Minority languages</strong>: While building language resources for the minority (or low-resource) languages using webpages, the authors noticed that majority of the webpages that contained text in a minority language also contained text in other languages. It was problematic because the data collection method they were using to build the resources would have also scraped multi-lingual documents into one corpus for a single language and would have led to incorrect and noisy downstream language resources.</li>
  <li><strong>Sparse or impoverished training data</strong>: Minority languages lack proper digital presence and thus, in turn, have less annotated training data for the language identification systems to train on. The authors have tackled this issue with the use of weakly supervised models trained on small datasets (a few thousand examples) where they down require a lot of labeled data.</li>
  <li><strong>Multilingual documents</strong>: With the increase of multilingual users on the internet, a lot more digital content is generated where multiple languages are used, albeit with varying levels of language usage. Identification should work well even in documents with an imbalanced language distribution. For instance, a document with 3% French, 95% English, and 2% Italian.</li>
</ul>

<h3 id="constraintsassumptions">Constraints/Assumptions</h3>

<ul>
  <li>To <em>keep the manual annotations reliable</em>, they only considered the documents containing words only in two languages.</li>
  <li>Since the two languages present in the documents are known, models are also aware of the <em>two languages a priori.</em></li>
  <li>Generally, for sequence labeling tasks, we can assume each <em>sequence</em> (not individual tokens of a sequence) as <a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">iid</a> on some distribution common to all documents, but we cannot assume this in language identification as one document might have 90% of its words in Hindi while another might only have 20%. Thus authors have made the simplified assumption that sequences <em>within a document</em> are iid. So, the authors <em>considered each token within a sentence as iid</em>.</li>
  <li>The problem becomes trivial when scripts of languages are different, and hence only the languages with a <em>Latin orthography</em> are chosen.</li>
  <li>Only training data is a small amount of <em>monolingual text</em> for each language. It is assumed that there are no annotated sequences available.</li>
</ul>

<h3 id="dataset">Dataset</h3>

<p>Following are the 30 languages covered by the researchers divided by continents:</p>

<ol>
  <li><strong>Africa</strong>: Lingala, Malagasy, Oromo, Pular, Fulfulde, Somali, Hausa, Sotho, Tswana, Igbo, Yoruba, Zulu</li>
  <li><strong>Europe</strong>: Azerbaijani, Lombard, Basque, Cornish, Croatian, Czech, Serbian, Faroese, Slovak, Hungarian</li>
  <li><strong>Asia</strong>: Azerbaijani, Banjar, Cebuano, Uzbek, Kurdish</li>
  <li><strong>North America</strong>: Nahuatl, Chippewa, Ojibwa</li>
  <li><strong>Australia</strong>: Kiribati</li>
</ol>

<h4 id="training-data">Training Data</h4>

<p>They collected the Monolingual samples for all the 30 languages from the following four sources:</p>

<ol>
  <li><a href="https://www.unicode.org/udhr/">The Universal Declaration of Human Rights</a></li>
  <li><a href="https://meta.wikimedia.org/wiki/List_of_Wikipedias">Non-English Wikipedias</a></li>
  <li><a href="https://www.jw.org/">The Jehovah’s Witnesses website</a></li>
  <li><a href="https://rosettaproject.org/">The Rosetta project</a></li>
</ol>

<p>To mitigate the varying availability of tokens of the language samples while creating the training data, an equal number of words were sampled from English and each of the second language.</p>

<h4 id="evaluation-data">Evaluation Data<a href="eval_data"></a></h4>

<ul>
  <li>
    <p><strong>Data Collection</strong>: The BootCat Tool collects webpages on a language by repeatedly searching for the language-specific seed words. Out of all the collected documents, they manually retained only those webpages which just had English and one of the above languages. The dataset is available at: <a href="http://www-personal.umich.edu/~benking/resources/mixed-language-annotations-release-v1.0.tgz">mixed-language-annotations-release-v1.0</a>.</p>
  </li>
  <li>
    <p><strong>Cleaning</strong></p>

    <ol>
      <li>Stripped of HTML</li>
      <li>Converted to utf-8</li>
      <li>Replaced HTML escape sequences with utf8 characters</li>
      <li>Discarded the documents with encoding errors (mixed encodings)</li>
    </ol>
  </li>
  <li>
    <p><strong>Annotation</strong>: Following are the annotation rules mentioned in the paper:</p>

    <ul>
      <li>Well-digested English loanwords and borrowings -&gt; foreign language;</li>
      <li>Ordinary proper names (like “John Williams” or “Chicago”) -&gt; language in the context;</li>
      <li>abbreviations (like “FIFA” or “BBC”) -&gt; language in the context;</li>
      <li>Common nouns (like “Stairway to Heaven” or “American Red Cross”) -&gt; language of the words they were in;</li>
      <li>Abbreviations that spelled out English words -&gt; language of the words they were in;</li>
      <li>If language use was ambiguous -&gt; annotator’s best guess;</li>
      <li>Numbers or punctuation -&gt; no label.</li>
    </ul>

    <p>The <strong>average inter-annotator agreement</strong> on a few hundred words from each of the eight documents was 0.988 with 0.5 agreement expected by chance for kappa of 0.975.</p>
  </li>
</ul>

<h3 id="methodology">Methodology</h3>

<h4 id="baseline-setup">Baseline Setup</h4>

<ul>
  <li><strong>Objective</strong>: Language identification (through classification) of a word, ignoring the fact that it is the part of a sequence. Using this method, the authors want to evaluate various features and classifiers.</li>
  <li><strong>Training Data</strong>: Uniformly sampled 1000 words with replacement for appropriate language.</li>
  <li><strong>Eval Data</strong>: It is not mentioned what data was used to evaluate these models: either the annotated data itself or other sets of uniformly sampled words from selected languages. Most likely, it’s the annotated data created in the evaluation data section above.</li>
  <li><strong>Classes</strong>: Again, it is not clear from the text if they trained a single classifier to classify into all the languages or created 30 different binary classifiers. Although, based on the <em>a priori knowledge of two languages</em> assumption, I think it’s the latter with two classes being <code class="language-plaintext highlighter-rouge">english</code> and <code class="language-plaintext highlighter-rouge">foreign lang</code>.</li>
  <li><strong>Feature Engineering</strong>:
    <ol>
      <li>character unigrams</li>
      <li>character bigrams</li>
      <li>character trigrams</li>
      <li>character 4-grams</li>
      <li>character 5-grams</li>
      <li>full word</li>
    </ol>
  </li>
  <li><strong>Feature Selection</strong>:
    <ul>
      <li>They trained a logistic regression classifier to find the best set of features.</li>
      <li>Using all available features - {1,2,3,4,5}-grams, and the word - gave the best accuracy score of 0.88 on the data.</li>
    </ul>
  </li>
  <li><strong>Model Evaluation</strong>:
    <ul>
      <li>The evaluation metric is the <strong>accuracy</strong>.</li>
      <li>Models were evaluated for varying training sizes from 10 to 1000.</li>
    </ul>
  </li>
  <li><strong>Model Selection</strong>:
    <ul>
      <li>Models considered: Logistic Regression, Naive Bayes, Decision tree, and Winnow2.</li>
      <li><a href="http://mallet.cs.umass.edu/index.php">MALLET</a> was used to train all the above four classifiers on the training data using the best set of features determined in the feature selection step.</li>
      <li>
        <p>The best model was <strong>naive Bayes</strong> with the performance shown in the below plot of learning curves for all the four models as the training size changes from 10 to 1000.</p>

        <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-10/lang_id_model_accuracy_1.png" alt="Model Accuracy for independent word-level language classification" style="display:block;text-align:center" width="400" />
  </figure>
      </li>
    </ul>
  </li>
</ul>

<h4 id="main-task-solution">Main Task Solution</h4>

<ul>
  <li><strong>Objective</strong>: Language identification of each word in a sequence of words from a document.</li>
  <li><strong>Training Data</strong>: Monolingual labeled data for English and the selected foreign language.</li>
  <li><strong>Eval Data</strong>: Hand annotated sequences from the multilingual documents (discussed <a href="eval_data">here</a>)</li>
  <li><strong>Classes</strong>: Not mentioned, but I guess, they trained 30 different models for each foreign language with binary class classification into two categories: <code class="language-plaintext highlighter-rouge">english</code> and <code class="language-plaintext highlighter-rouge">foreign lang</code>.</li>
  <li><strong>Preprocessing Steps</strong>:
    <ol>
      <li>Word boundaries were defined by <em>punctuation</em> or <em>whitespace</em></li>
      <li>Excluded tokens containing a digit</li>
    </ol>
  </li>
  <li><strong>Feature Engineering</strong>: In addition to the {1,2,3,4,5}-grams and the whole word as features, to provide some sequence-relevant information they added the following two features:
    <ol>
      <li>Feature for each possible punctuation/digit between previous and current words.</li>
      <li>Feature for each possible punctuation/digit between current and next words.</li>
    </ol>
  </li>
  <li><strong>Model Selection</strong>:
    <ol>
      <li>Due to limited training data and different nature of training and evaluation data (more labeled monolingual vs. less labeled multilingual), the following weakly and semi-supervised models were implemented:
        <ul>
          <li>Linear Chain CRF trained with Generalized Expectation criteria (<strong>best performer</strong>)</li>
          <li>HMM trained with Expectation-Maximization (EM)</li>
          <li>Logistic Regression trained with Generalized Expectation criteria</li>
        </ul>
      </li>
      <li><a href="http://mallet.cs.umass.edu/index.php">MALLET</a> was used for training.</li>
    </ol>
  </li>
  <li><strong>Model Evaluation</strong>:
    <ul>
      <li>The evaluation metric is the <em>accuracy</em>.</li>
      <li>Models were evaluated for varying training sizes from 10 to 1000.</li>
      <li>Naive Bayes classifier (best performing) from the baseline setup was used to compare all the other models.</li>
      <li>The best model was CRF trained with the GE gave a consistent performance. The performance of other models is in the below plot of learning curves as the training size changes from 10 to 1000:
        <figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-10/lang_id_model_accuracy_2.png" alt="Model Accuracy for sequence labeling task in multilingual documents" style="display:block;text-align:center" width="400" />
  </figure>
      </li>
    </ul>
  </li>
  <li><strong>Error Analysis</strong>:
    <ul>
      <li>Named Entity errors, possibly because of arbitrary rule decided during the annotation</li>
      <li>Shared word errors, possibly because of arbitrary rule decided during the annotation</li>
      <li>Other remaining errors</li>
    </ul>
  </li>
</ul>

<h3 id="insightsthoughts-on-the-paper">Insights/Thoughts on the paper</h3>

<ol>
  <li>The problem was formulated as a <strong>sequence labeling problem</strong>. Can it be defined in some other way?</li>
  <li>The problem becomes <strong>trivial</strong> if languages present in the document do not share the character set. Can we somehow convert our complex problem into a trivial one? There are two reasons for two different languages in a document having the same script:
    <ul>
      <li>Both languages share the same native script.</li>
      <li>One language is transliterated into the script of the other language (usually, the dominant language in the document)</li>
    </ul>

    <p>If it’s the first case, then we have no way of making the problem trivial, but if it’s the second case, then we can back-transliterate the language into its original script thus, making the labeling task trivial. Although, now a new challenge arises: identifying which word should be back-transliterated. It requires identifying the language of the word, and we are back to the original problem we started with unless we have a magic back-transliteration model which just takes a word and gives a legit back transliteration in the correct language. To my knowledge, there is no such model as of yet.</p>
  </li>
  <li>The authors conducted some amount of <strong>feature engineering</strong>, and also did a thorough evaluation to select the optimal feature set.</li>
  <li>Since with small training data we have achieved a decent accuracy, this process should be <strong>easily scalable</strong> in the production environment for multiple languages.</li>
  <li>The system should have <strong>high throughput</strong>, as the classification is built with simple features and compute efficient models.</li>
  <li>How is it <strong>weakly supervised</strong>?
    <ul>
      <li>Models are trained on two <strong>monolingual</strong> example texts thus only learning to classify a word into one of the two languages.</li>
      <li>Any sequential dependencies between words must be learned by the model on its own because of the lack of any particular features which might inform the model about the dependencies.</li>
      <li>The inference is done on the sequences from a multilingual document.</li>
    </ul>
  </li>
  <li>How to properly <strong>handle Named Entities (NEs)</strong>?
    <ol>
      <li>We can create a separate classification category for NEs within our language identification system.</li>
      <li>We don’t evaluate on NEs, but for this, we need to know which tokens represent a NE.</li>
      <li>We use a language-independent Named Entity recognition system in our language identification system.</li>
    </ol>
  </li>
  <li>What is the <strong>model performance for each language</strong> independently? It wasn’t clear from the paper if they trained a single classifier or created 30 different models. If it was just one model, then we also need to study its performance on individual languages. If it’s the latter, then the results presented in the publication are averaged over 30 models, and we should explore each model’s performance.</li>
  <li>
    <p>The authors wanted to cover as many languages as possible for the language identification task, so they specifically talked about the dependence on a priori knowledge of the languages. And because of the dearth of data, only 30 languages were evaluated. <strong>How do we extend the system for thousands of languages where each document only contains 2-3 languages?</strong> One approach could be, first identify the languages present in the document and then use the proposed system trained with the languages determined in the first step to do sequence labeling. Here’s the schematic of this two-level language identification system for any set of languages. Document Language Identification is the identification of languages present in the document, and Token Language Identification is the actual task of identifying each token’s language in that document.</p>

    <figure class="image">
 <img src="https://trigonaminima.github.io/assets/2020-10/lid3.png" alt="Two-level language identification system for any set of languages" style="display:block;text-align:center;margin-left: auto;margin-right: auto;" />
 </figure>
  </li>
  <li>The publication mainly presented their work on webpages or documents and not on <strong>short-text like content</strong> generated on social media. Also, the overall concept will remain the same, but the distribution of languages will differ there. And challenges like word normalization will also need to be taken care of.</li>
  <li>Documents in the corpus used for training and inference only contained documents in two different languages having the same native script - Latin. The authors remarked that they saw <strong>rare usage of code-mixing</strong> in these documents. It might be because of the data collection method they used or the languages they chose to work with. I don’t know what proportion of the actual content on the web could be code-mixed text, but if they had used code-mixed corpus, there’d be more challenges to be solved. Different languages having the same native script might have plenty of unique words; there might not be enough sub-word sharing among the vocabulary of both the languages. Whereas, in code-mixed text, there might be more sub-word sharing that might lead to poor performance. Things might also go in reverse because of the presence of some other patterns models might perform better.</li>
  <li>Any discussions on the <strong>transliteration</strong> were completely skipped.</li>
</ol>]]></content><author><name>Shivam Rana</name></author><category term="NLP" /><category term="Publication" /><summary type="html"><![CDATA[I am reviewing the literature available on language identification for multilingual documents, focusing on Indic languages. I’ll try to cover it in chronological order, but there might be a few misses here and there. After a decent coverage of the research, I expect to have enough understanding to discuss the challenges present in this task in a code-switched setting and its importance.]]></summary></entry><entry><title type="html">Design Thinking and Life</title><link href="https://trigonaminima.github.io/2020/08/design-viz-tool/" rel="alternate" type="text/html" title="Design Thinking and Life" /><published>2020-08-08T00:00:00+00:00</published><updated>2020-08-08T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/08/design-viz-tool</id><content type="html" xml:base="https://trigonaminima.github.io/2020/08/design-viz-tool/"><![CDATA[<p>At the start of July 2020, I enrolled in a Coursera course on Design Thinking called <a href="https://www.coursera.org/learn/uva-darden-design-thinking-innovation">Design Thinking for Innovation</a>. Today, I’ve finally finished it. This post is a brief description of what I learned and my final submission about the reflection on one of the four design tools covered in the course that I employed in a challenge/problem of my choice.</p>

<h2 id="what-is-design-thinking">What is Design Thinking</h2>

<p>There are two kinds of problems we generally face:</p>

<ol>
  <li>
    <p><strong>Tame problems (also called puzzles)</strong></p>

    <p>These problems are those where we just need the right data, and we can solve the problem. In problems like these, the actual objective is known and all (or majority) of the stakeholders agree on the objective as well. Usually, we’ll be aware of what the cause-effect relationship is or it can be easily determined. And since we have a defined problem and know a lot about the details, these kinds of problems often tend to be solved by analytical methods. For example, we want to see how successful our product is in terms of processing requests, we can easily find out success/failure rates and build a very comprehensive dashboard to judge how successful our offering is.</p>
  </li>
  <li>
    <p><strong>Wicked problems (also called mysteries)</strong></p>

    <p>These problems are the complete opposite of the tame problems. We don’t exactly know the objective we need to focus on. The challenge is so broad that not even stakeholders agree on the key objectives. We often don’t have the right data, or we have so much data that going through it will be like trying to find a needle in the haystack where we don’t even know what the needle even looks like. Analytical methods are poorly equipped to solve these problems as it’s really hard to determine the cause-effect relationship for these problems. For example, consider your product is not doing good and your dashboard shows it through numbers. Now to improve your product, you need to find out what are factors working for and against it. The stakeholders’ opinions will differ in what could be the underlying factors, and it’s opinions because even they don’t know what the problem is. As you can see, because of the vagueness of the problem, it’s really hard to come up with a solution.</p>
  </li>
</ol>

<p>Design thinking is a structured way of tackling wicked problems. In design thinking, <strong>all the focus is on the customer</strong>. It asks us to empathize with the customers as they are the ones who will use our product. A customer could be anyone: public, other teams, executives of your org, executives from the client org, or anyone else who directly interacts with your product. The process of design thinking is based on the following four questions:</p>

<ol>
  <li><strong>What is?</strong>: Study the current reality about the challenge and the customers. You do your research around the idea. You interview customers and experts. From the data gathered, you build a mind map of the complete data. All of this research will guide you in the next stages.</li>
  <li><strong>What if?</strong>: Based on the data collected in the “What is?” stage, you ideate and come up with a large repository of ideas, doesn’t matter how wild or crazy they are. We are just doing ideation right now and not judging which idea is good or bad.</li>
  <li><strong>What wows?</strong>: You finalize a few most promising ideas from the “What if?” stage and test your assumptions through cardboard prototypes or digital designs.</li>
  <li><strong>What works?</strong>: Finally, you test your idea in the real world with low-fidelity prototypes. You engage with customers to get to know about what works and what doesn’t and then you pivot or go ahead according to the feedback.</li>
</ol>

<p>Here’s the diagram from the course depicting the above four stages:</p>

<figure class="image">
  <img src="https://trigonaminima.github.io/assets/2020-08/design_thinking_proc.png" alt="design thinking process" style="display:block;text-align:center" />
  <figcaption>&nbsp; Image taken from the <a href="https://www.coursera.org/learn/uva-darden-design-thinking-innovation">Design Thinking for Innovation</a> course slides.</figcaption>
</figure>

<p>There’s a meaning to the divergence and convergence of lines. At the start, due to the vagueness of the challenge you are all around the place. Your thoughts are not structured and you don’t know what to do or how to solve it. Lines are all divergent at this point. Once you start with each stage, you want to be broad and open to collect a lot of information, leading to divergent thinking, and then once you go past the first half, you start to make sense of the data and you become convergent. You get especially divergent in the “What if?” stage as this is the brainstorming time. You and your team members have to be as broad as possible to think of a lot of possibilities. Then, once you move towards the “What works?” stage, you become focussed on the testing of your ideas, and thus the lines are much closer than at the start. <em>To put the divergence and convergence in other words, you have to think creatively at the start of each level and gradually move towards analytical thinking as you finish that level.</em></p>

<p>So in essence, Design Thinking is <strong>not</strong> just about creativity. It’s also about having an analytical mindset. Both, your left and right minds are and should be, engaged in problem-solving when you’re employing Design Thinking. Folks who think Design Thinking is just about applying your creativity are mistaken. That is also   how one should go about in one’s life. Isn’t it? Be open and broad enough that you can see a problem from both the angles: analytical and creative. What needs figuring out is how to inculcate the habit of approaching life that way?</p>

<h2 id="mindset">Mindset</h2>

<p>Design Thinking gives us a process and a toolset to work through our challenges, but to be prepared enough to identify the right opportunities that the process gives us, takes a different mindset. Having this mindset is not a necessary and required condition, but it is a necessary condition. Here are some qualities that might help you judge your mindset:</p>

<ul>
  <li><strong>Customer Empathy</strong>: If you are more empathetic towards your customer, you are more likely to find hidden opportunities. You might miss these opportunities if you don’t consider the customer’s point of view.</li>
  <li><strong>Learning and Growth Mindset</strong>: You think everything you work on is a learning opportunity. If you fail, then you take it as a lesson. If you are successful then you build confidence. If you have this mindset then you’ll not be afraid of failure and taking bets and testing your assumptions. This mindset also gives you the power to accept something if it questions your world view.</li>
  <li><strong>Broad repertoire</strong>: It’s obvious, that having a lot of experience gives you an edge. You can apply something that worked in one industry in some other industry where no one even thinks of doing. The experience is not just about working in that domain or industry, it’s also about reading on them. It’s about meeting new people and talking to them about their industry. It’s about being more open to your world.</li>
</ul>

<p>These are the major headers that the instructor asks you to work towards to build a proper mindset to be able to identify and implement innovative solutions.</p>

<p>Hmmmm. These are the points I have regarding my mindset-</p>

<ul>
  <li>Identify who my customers are and how to develop this “empathy” for them.</li>
  <li>Coding teaches you to have a learning and growth mindset and to learn from your mistakes. Want to see how something works, tinker with the code, and see what is the result. Found an error? Understand what’s the cause, find a solution, and write a test for it so that you never have to deal with it in the future. I have to just extend the same teachings to my life. Of course, easier said than done.</li>
  <li>Getting a broad repertoire is a big one. I do a lot of <del>procrastination</del> reading. So, that’s one down. Next is, meeting new people. I do meet a lot of techies. How do I expand my horizon? How do introverts meet new people? Can somebody answer this for me? What? Did you say, I have to get out of my comfort zone? How do I do that? Jokes apart, I think, this has to be ingrained in you, even if you’re an introvert, to go out and just talk to people. I try to do that sometimes. At times, I’ve failed miserably, but sometimes, I’ve been successful too.</li>
</ul>

<p>So far, it seems like Design Thinking teaches you how you should live your life. The ability to solve your business problems is just an added advantage you get with it. That makes Design Thinking sound so intimidating. (But durr ke aage jeet hai?)</p>

<h2 id="common-design-tools">Common Design Tools</h2>

<p>The following are the five design tools discussed in the course.</p>

<h3 id="1-visualization">1. Visualization</h3>

<p>It’s the action of converting your thoughts or ideas or concepts in visual form. We perceive information from different channels in different ways. Even when we are creating these visuals, we employ different parts of our brains. These visual forms can be anything: drawing, flow chart, block diagram, photograph, collage, mind map, video, dashboard, graphs, or plots. It can be anything that you can use to represent your idea visually. Visualization is the kind of tool that will apply to all the stages of the Design Thinking process.</p>

<h3 id="2-storytelling">2. Storytelling</h3>

<p>Storytelling is also about communication like visualization. With good storytelling, you communicate across your points with less information loss. Everything has a story. A good storyteller connects with the audience and imparts a complex idea into simple terms. I think this tool is more concerned with building your growth mindset. It helps you become a great manager or leader in general. It helps you become more empathetic to your followers. What getting good at storytelling also contributes towards is how you communicate your challenges to your audience. It aids the Design Thinking process at every point in the way. Here are three points one should focus on to become a good storyteller:</p>

<ol>
  <li>Know your audience. Your story is built for the audience after all. Know where they come from. Know what they do. Know where they are going after this. Know what they need. Identify a common point using which you can hook them in. It doesn’t have to relate to you, but it has to be relatable to the audience.</li>
  <li>Have a clear sequence of events from start till the end. This one is obvious. You’ll lose their focus if you’re jumbling things up.</li>
  <li>Ask questions and then answer them as well. Ask the questions to build the suspense, pull them into your story, and then release the tension by answering the questions. If you don’t answer the questions then the audience is going to be confused at the end, not sure of what to do.</li>
</ol>

<h3 id="3-ethnography-tool">3. Ethnography Tool</h3>

<p>Ethnography Tool is used in the “What is?” stage, the stage where we are collecting all the data about our challenge. Ethnography means the study of people in a particular society by observing them in their natural setting. The Ethnography tool does that. It asks you to study your customers in their natural environment carrying on with their lives. You use some aids to get the observations: like journaling their lives and actions or using projective tools like making a collage to get an insight about things your customers can’t clearly express using words. You need to have a diversity in the customers you are observing so that you have more chances of getting different points of view. Once you’re done with the observation period, you conduct interviews to get your customers’ assumptions and habits. The idea is to gain information to inform your ideas during the brainstorming stage.</p>

<h3 id="4-mind-mapping">4. Mind Mapping</h3>

<p>Mind Mapping is another tool used in the “What is?” stage. Here also, the idea is to get different perspectives that might inform your final ideas and solutions. Here instead of customers, you involve stakeholders and other people familiar with the challenge. You call everyone into a room and ask them to go through the data you collected from the customers and other research and make note of the thoughts they get. Thoughts can be negative, as well as, positive as long as, they are capable of impacting your ability to generate new ideas. You ask them to cluster their thoughts into categories and then merge all the clusters from everyone in collaboration. Then within the group, you all discuss and connect the clusters. That final connected graph of insights and observations is the final mind map. This mind map is going to be a big help in generating your ideas in the “What if?” stage.</p>

<p>The instructors mentioned getting all the participants in one room, break into small groups, and work with whiteboards and sticky notes. I was wondering if this process can be done online. And also with just one person at a time. Sort of like, discuss the thoughts and get the clusters from one person and show the data and the clusters of the first person to the second and so on till you’ve gone through enough people. I don’t know how effective this would be as you’ve lost that collaboration between the participants, but at the same time, the next person does have the context of the previous person’s thoughts and clusters.</p>

<h3 id="5-learning-launch">5. Learning Launch</h3>

<p>This design tool is used in the final stage, “What works?”. The objective is to build quick prototypes to be tested in the real world and see how they perform. It is the stage where your assumptions about the customers are tested. You build ideas and the hypotheses associated with each of them, and after launching them you learn if they were right or wrong. If they were right, you build the version 2.0 else you either pivot or table the idea. Don’t ever trash the idea as it can be valuable later.</p>

<p><br /></p>

<p>This was all I learned from the course. I have quite a wicked problem to solve where, I think, Design Thinking should work. Let’s see how it goes.</p>

<p>In the end, as an assignment, they asked to create a document with my reflection on any one of the four design tools except the Ethnography tool. The reflection needs to be divided into five sections:</p>

<ol>
  <li>Challenge: Describe your challenge, including all relevant information.</li>
  <li>Selection: In your own words, briefly describe the tool you selected (e.g., what it is and why you selected it for your challenge – including any appropriate video lecture references).</li>
  <li>Application: Describe how you applied the tool you selected to your challenge (e.g., what you did and how the tool was applied effectively or ineffectively).</li>
  <li>Insight: Describe the insight you gained from applying the tool you selected to your challenge (e.g., how an insight affected your thinking about the challenge and design thinking more broadly).</li>
  <li>Approach: Describe what you might do differently next time – applying the same tool you selected or a different one – and the reason(s) why.</li>
</ol>

<p>And, here’s my submission on the Visualization tool.</p>

<h2 id="better-communication-of-technical-topics-through-visualization">Better Communication of Technical Topics through Visualization</h2>

<h3 id="challenge">Challenge</h3>

<p>I am a Computer Science engineer (specifically, Data Scientist) and my work involves drawing out insights from data and make the business processes better based on those insights and providing intelligent automation tools built using the data underlying those insights. Was that vague? Well, the job description of a Data Scientist <em>is</em> vague. The industries I’m currently associated with are banking and retail. The biggest hurdle I’ve seen people like me facing is the lack of adoption of the intelligent technologies that we build. After a lot of introspection and talking to a lot of people at different levels within and outside my organization, I concluded that the leadership, even after being shown the efficacy of our solutions, is not open to them because they don’t understand them well. So, the challenge to be solved was how to improve the communication of such highly tech-oriented material to executives and non-techies.</p>

<h3 id="selection">Selection</h3>

<p>The tool I chose to solve this problem was Visualization. Visualization involves describing an idea in a visual form making it intuitive and easy to understand for anyone. Visualization doesn’t just mean drawing, although you can draw if you can: it means to break your idea into small components and then describing each of them using any pictorial representation available to us: block diagrams, graphs, plots, mind maps, stick figures. The <a href="https://www.coursera.org/learn/uva-darden-design-thinking-innovation/lecture/xju53/visualization-tool">Visualization Tool video</a> by Angela Myer, describes it pretty well actually: “The visualization is a really core component of the way we communicate, whether we are aware of it or not.” Since my challenge was more about communication and less about the solution itself, I selected visualization as my tool-set to move a step closer to the mass adoption of our solutions within the organization.</p>

<h3 id="application">Application</h3>

<p>The solution the team had worked on for the past month was a set of generic forecasting models which can be applied to any business data to get the future estimates. We had done extensive testing on a multitude of real organization data, so we knew it worked. We had also shown it to some business leads and received a positive response. Now the challenge was to communicate this to the org heads for better adoption. I first divided the solution into four parts: why we need forecasting models, what data challenges are there to build the forecasting challenges, what and how our solution solves those challenges, and how it can be easily used for many of our existing processes with a specific use-case.</p>

<p>For each of the four parts, I used a lot of visual cues to explain my points. To explain the challenges in the 2nd part, I showed different kinds of data across different time ranges that the model has to consider. To explain our solution, I had created a process flow diagram showing how all the components come into play to build it and how our model performs on the real data from two different organizational use-cases. Then to show the ease of use of the solution, I showed a dashboard where our forecasting models were doing live forecasting on real data with quite accurate results.</p>

<h3 id="insight">Insight</h3>

<p>Our demo and presentation with visualizations drove the point home and within the presentation, a lot of discussions happened among the leaders about how this can really impact the business activities and even help us get more clients for our platform and put us at an advantage where we can negotiate on our terms. My manager was also happy at the reception we received from the presentation. As a result, we have gotten the opportunity to integrate the solution with other processes where we didn’t think of doing it. So, I’d say, visualization is quite an effective tool for communication and getting your point across. Looking at the results of using the visualization tool this way has given me an incentive to practice it more. Now, whatever I write, be it a blog post, presentation, or documentation, I try to include visualizations to make my material more intuitive.</p>

<h3 id="approach">Approach</h3>

<p>As professor Jeanne says, you have to practice these tools to get better at it. I am going to practice working on creating intuitive and simple visualizations that can capture my idea completely without any loss of understanding. Another thing I’d like to try along with the visualization is the Storytelling tool. I think, together, they have great potential at enabling  amazing communication. So, next time, I am going to focus on the storytelling aspect of it as well.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[At the start of July 2020, I enrolled in a Coursera course on Design Thinking called Design Thinking for Innovation. Today, I’ve finally finished it. This post is a brief description of what I learned and my final submission about the reflection on one of the four design tools covered in the course that I employed in a challenge/problem of my choice.]]></summary></entry><entry><title type="html">Configuration Management in Python</title><link href="https://trigonaminima.github.io/2020/06/py-config/" rel="alternate" type="text/html" title="Configuration Management in Python" /><published>2020-06-24T00:00:00+00:00</published><updated>2020-06-24T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/06/py-config</id><content type="html" xml:base="https://trigonaminima.github.io/2020/06/py-config/"><![CDATA[<p>There are a lot of ways to maintain configuration in Python project. I’ve recently learnt to do it in way that is not something new that I’ve found, but it was new to me. I’ll discuss my progression from basic hard-coding constants in a project to this new method.</p>

<p>Lets first consider an example application where there are <em>two steps</em>, each reading an <em>input file</em> and writing to an <em>output file</em>. This application will also use an API for which we need to use an <em>API secret key</em>. We’ll first create the API object using the API key. In the step 1, we’ll load the data then pass it to API to get the results and then the results will be saved. In the next step, we’ll load the saved API results, process it and save the new results. Here’s the python code implementing this logic in <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">myapi</span>
<span class="kn">import</span> <span class="nn">utils</span>

<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="n">API_SECRET_KEY</span> <span class="o">=</span> <span class="s">"SECRET_IS_THERE_IS_NO_SECRET"</span>
<span class="n">myapi_obj</span> <span class="o">=</span> <span class="n">myapi</span><span class="p">(</span><span class="n">API_SECRET_KEY</span><span class="p">)</span>

<span class="n">data_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">"./data/"</span><span class="p">)</span>
<span class="n">STEP1_IN_F</span> <span class="o">=</span> <span class="n">data_dir</span> <span class="o">/</span> <span class="s">"1_infile.csv"</span>
<span class="n">STEP1_OUT_F</span> <span class="o">=</span> <span class="n">data_dir</span> <span class="o">/</span> <span class="s">"1_outfile.csv"</span>
<span class="n">STEP2_OUT_F</span> <span class="o">=</span> <span class="n">data_dir</span> <span class="o">/</span> <span class="s">"2_outfile.csv"</span>

<span class="c1"># step 1
</span><span class="n">step1_data</span> <span class="o">=</span> <span class="n">utils</span><span class="p">.</span><span class="n">load_data</span><span class="p">(</span><span class="n">STEP1_IN_F</span><span class="p">)</span>
<span class="n">step1_out</span> <span class="o">=</span> <span class="n">myapi_obj</span><span class="p">.</span><span class="n">process</span><span class="p">(</span><span class="n">step1_data</span><span class="p">)</span>
<span class="n">utils</span><span class="p">.</span><span class="n">save_data</span><span class="p">(</span><span class="n">step1_out</span><span class="p">,</span> <span class="n">STEP1_OUT_F</span><span class="p">)</span>

<span class="c1"># step 2
</span><span class="n">step2_out</span> <span class="o">=</span> <span class="n">utils</span><span class="p">.</span><span class="n">process</span><span class="p">(</span><span class="n">step1_out</span><span class="p">)</span>
<span class="n">utils</span><span class="p">.</span><span class="n">save_data</span><span class="p">(</span><span class="n">step2_out</span><span class="p">,</span> <span class="n">STEP2_OUT_F</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>The above code will work fine, and in fact, if the actual code size is this then keeping the hard-coding values is actually fine. Keeping that aside, some of the reasons why hard-coding values should be avoided are:</p>

<ul>
  <li>Projects are rarely ever this small and they consequently, will have many more such hard-coded constants.</li>
  <li>By keeping hard-coded secrets (for eg., <code class="language-plaintext highlighter-rouge">API_SECRET_KEY</code>) inside the main file we risk the possibility of making them public by accidentally pushing them to the version control systems (e.g., Github).</li>
  <li>If the steps 1 and 2 were more modular (defined in different modules) and this main file was just gluing them together, how will you handle the config?
    <ul>
      <li>If you define these constants in each individual module, you’ll introduce duplicate code. You’ll be more prone to errors if you miss even a single change. You’ll be less efficient. Imagine editing <code class="language-plaintext highlighter-rouge">STEP1_OUT_F</code> file inside step 1 file and again changing it in step 2 file as it’s used as an input in step 2.</li>
      <li>If you define the constants in the main file and send them as function arguments in each step, you will have eliminated the redundancy, but in doing so, you’ve made code less readable if you’ve a lot of such variables to pass. One way to solve this would be to add all these constants inside a <code class="language-plaintext highlighter-rouge">dict</code> and then pass that <code class="language-plaintext highlighter-rouge">dict</code> around. But this’ll still require us to define the <code class="language-plaintext highlighter-rouge">dict</code> inside the main file which will add boilerplate code.</li>
    </ul>
  </li>
</ul>

<p>To solve the hard-coding, redundancy and boilerplate code issues, we can save all the constants inside a separate text file (e.g., <code class="language-plaintext highlighter-rouge">config.ini</code>) and then load this file inside the main file and then just pass that object around. The issue with this would be to parse the text file to get all the values and in the correct data types. And this is where the python <code class="language-plaintext highlighter-rouge">configparser</code> module helps.</p>

<h2 id="configuration-file-parser">Configuration file parser</h2>

<p>Python documentation <a href="https://docs.python.org/3/library/configparser.html">page</a> has a pretty comprehensive reference on the usage of <code class="language-plaintext highlighter-rouge">configparser</code>. I’ll just show what changes to do in our toy application.</p>

<p>Here’s our new config - <code class="language-plaintext highlighter-rouge">config.ini</code>.</p>

<figure class="highlight"><pre><code class="language-ini" data-lang="ini"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre><span class="nn">[API]</span>
<span class="py">SECRET</span> <span class="p">=</span> <span class="s">SECRET_IS_THERE_IS_NO_SECRET</span>

<span class="nn">[DIR]</span>
<span class="py">DATA</span> <span class="p">=</span> <span class="s">./data/</span>

<span class="nn">[FILES]</span>
<span class="py">STEP1_IN_F</span> <span class="p">=</span> <span class="s">1_infile.csv</span>
<span class="py">STEP1_OUT_F</span> <span class="p">=</span> <span class="s">1_outfile.csv</span>
<span class="py">STEP2_OUT_F</span> <span class="p">=</span> <span class="s">2_outfile.csv</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>And here’s our changed code in <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">myapi</span>
<span class="kn">import</span> <span class="nn">utils</span>
<span class="kn">import</span> <span class="nn">configparser</span>

<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="c1"># loading and parsing the config
</span><span class="n">config</span> <span class="o">=</span> <span class="n">configparser</span><span class="p">.</span><span class="n">ConfigParser</span><span class="p">()</span>
<span class="n">config</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="s">'config.ini'</span><span class="p">)</span>

<span class="n">API_SECRET_KEY</span> <span class="o">=</span> <span class="n">config</span><span class="p">[</span><span class="s">"API"</span><span class="p">][</span><span class="s">"SECRET"</span><span class="p">]</span>
<span class="n">myapi_obj</span> <span class="o">=</span> <span class="n">myapi</span><span class="p">(</span><span class="n">API_SECRET_KEY</span><span class="p">)</span>

<span class="n">data_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">config</span><span class="p">[</span><span class="s">"DIR"</span><span class="p">][</span><span class="s">"DATA"</span><span class="p">])</span>
<span class="n">STEP1_IN_F</span> <span class="o">=</span> <span class="n">data_dir</span> <span class="o">/</span> <span class="n">config</span><span class="p">[</span><span class="s">"FILES"</span><span class="p">][</span><span class="s">"STEP1_IN_F"</span><span class="p">]</span>
<span class="n">STEP1_OUT_F</span> <span class="o">=</span> <span class="n">data_dir</span> <span class="o">/</span> <span class="n">config</span><span class="p">[</span><span class="s">"FILES"</span><span class="p">][</span><span class="s">"STEP1_OUT_F"</span><span class="p">]</span>
<span class="n">STEP2_OUT_F</span> <span class="o">=</span> <span class="n">data_dir</span> <span class="o">/</span> <span class="n">config</span><span class="p">[</span><span class="s">"FILES"</span><span class="p">][</span><span class="s">"STEP2_OUT_F"</span><span class="p">]</span>

<span class="c1"># step 1
</span><span class="n">step1_data</span> <span class="o">=</span> <span class="n">utils</span><span class="p">.</span><span class="n">load_data</span><span class="p">(</span><span class="n">STEP1_IN_F</span><span class="p">)</span>
<span class="n">step1_out</span> <span class="o">=</span> <span class="n">myapi_obj</span><span class="p">.</span><span class="n">process</span><span class="p">(</span><span class="n">step1_data</span><span class="p">)</span>
<span class="n">utils</span><span class="p">.</span><span class="n">save_data</span><span class="p">(</span><span class="n">step1_out</span><span class="p">,</span> <span class="n">STEP1_OUT_F</span><span class="p">)</span>

<span class="c1"># step 2
</span><span class="n">step2_out</span> <span class="o">=</span> <span class="n">utils</span><span class="p">.</span><span class="n">process</span><span class="p">(</span><span class="n">step1_out</span><span class="p">)</span>
<span class="n">utils</span><span class="p">.</span><span class="n">save_data</span><span class="p">(</span><span class="n">step2_out</span><span class="p">,</span> <span class="n">STEP2_OUT_F</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>With <code class="language-plaintext highlighter-rouge">configparser</code> we have the following advantages:</p>

<ul>
  <li>We have eliminated the need for hard-coding the constants inside the main code - <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code>.</li>
  <li>Any changes in the config (<code class="language-plaintext highlighter-rouge">config.ini</code>) will not impact <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code> code.</li>
  <li>You can reasonably manage your config inside <code class="language-plaintext highlighter-rouge">config.ini</code> under different headers. For e.g., <code class="language-plaintext highlighter-rouge">FILES</code> sections can be divided into <code class="language-plaintext highlighter-rouge">STEP1</code> and <code class="language-plaintext highlighter-rouge">STEP2</code> headers.</li>
</ul>

<p>Some of the things which I found difficult or cumbersome to do using <code class="language-plaintext highlighter-rouge">configparser</code>:</p>

<ul>
  <li>Lack of automatic datatype inference. You have to use specific getter functions (<code class="language-plaintext highlighter-rouge">getint</code>, <code class="language-plaintext highlighter-rouge">getfloat</code>, <code class="language-plaintext highlighter-rouge">getboolean</code>) or create your own. For example, I don’t want to convert my string paths (<code class="language-plaintext highlighter-rouge">data_dir</code>, <code class="language-plaintext highlighter-rouge">STEP1_IN_F</code>, <code class="language-plaintext highlighter-rouge">STEP1_OUT_F</code> and <code class="language-plaintext highlighter-rouge">STEP2_OUT_F</code>) into the <code class="language-plaintext highlighter-rouge">pathlib.Path</code> objects inside <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code>. I want my config loader to handle that.</li>
  <li>The config becomes difficult to manage if it is very large and involves a lot of constants or steps. You’ll have to split it into multiple files and load them individually and do all the parsing stuff.</li>
</ul>

<p>We can do better! We should be able to handle these issues in an elegant way. And we do that by creating a <em>module</em> for the config itself.</p>

<h2 id="config-module">Config module</h2>

<p>For each header from the <code class="language-plaintext highlighter-rouge">config.ini</code>, I’ll create a new class and import them inside the <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code>. Here’s how the new <code class="language-plaintext highlighter-rouge">config.py</code> looks.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
</pre></td><td class="code"><pre><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>


<span class="k">class</span> <span class="nc">APIConf</span><span class="p">:</span>
    <span class="n">SECRET</span><span class="p">:</span> <span class="nb">str</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">SECRET</span> <span class="o">=</span> <span class="s">"SECRET_IS_THERE_IS_NO_SECRET"</span>


<span class="k">class</span> <span class="nc">DirConf</span><span class="p">:</span>
    <span class="n">data</span><span class="p">:</span> <span class="n">Path</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">data_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">"./data/"</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">Step1Conf</span><span class="p">(</span><span class="n">DirConf</span><span class="p">):</span>
    <span class="n">in_f</span><span class="p">:</span> <span class="n">Path</span>
    <span class="n">out_f</span><span class="p">:</span> <span class="n">Path</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">in_f</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"1_infile.csv"</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">out_f</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"1_outfile.csv"</span>


<span class="k">class</span> <span class="nc">Step2Conf</span><span class="p">(</span><span class="n">DirConf</span><span class="p">):</span>
    <span class="n">in_f</span><span class="p">:</span> <span class="n">Path</span>
    <span class="n">out_f</span><span class="p">:</span> <span class="n">Path</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>

        <span class="n">step1_fs</span> <span class="o">=</span> <span class="n">Step1Conf</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">in_f</span> <span class="o">=</span> <span class="n">step1_fs</span><span class="p">.</span><span class="n">out_f</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">out_f</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"2_outfile.csv"</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Here’s how we use the above config inside <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code>:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">myapi</span>
<span class="kn">import</span> <span class="nn">utils</span>
<span class="kn">import</span> <span class="nn">config</span>


<span class="c1"># loading and parsing the config
</span><span class="n">api_conf</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">APIConf</span><span class="p">()</span>
<span class="n">API_SECRET_KEY</span> <span class="o">=</span> <span class="n">api_cong</span><span class="p">.</span><span class="n">SECRET</span>
<span class="n">myapi_obj</span> <span class="o">=</span> <span class="n">myapi</span><span class="p">(</span><span class="n">API_SECRET_KEY</span><span class="p">)</span>

<span class="c1"># step 1
</span><span class="n">step1_f</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">Step1FileConf</span><span class="p">()</span>
<span class="n">step1_data</span> <span class="o">=</span> <span class="n">utils</span><span class="p">.</span><span class="n">load_data</span><span class="p">(</span><span class="n">step1_f</span><span class="p">.</span><span class="n">in_f</span><span class="p">)</span>
<span class="n">step1_out</span> <span class="o">=</span> <span class="n">myapi_obj</span><span class="p">.</span><span class="n">process</span><span class="p">(</span><span class="n">step1_data</span><span class="p">)</span>
<span class="n">utils</span><span class="p">.</span><span class="n">save_data</span><span class="p">(</span><span class="n">step1_out</span><span class="p">,</span> <span class="n">step1_f</span><span class="p">.</span><span class="n">out_f</span><span class="p">)</span>

<span class="c1"># step 2
</span><span class="n">step2_f</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">Step2FileConf</span><span class="p">()</span>
<span class="n">step2_out</span> <span class="o">=</span> <span class="n">utils</span><span class="p">.</span><span class="n">process</span><span class="p">(</span><span class="n">step2_f</span><span class="p">.</span><span class="n">in_f</span><span class="p">)</span>
<span class="n">utils</span><span class="p">.</span><span class="n">save_data</span><span class="p">(</span><span class="n">step2_out</span><span class="p">,</span> <span class="n">step2_f</span><span class="p">.</span><span class="n">out_f</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Look at how clean our <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code> is now! Here’s why I think this method is better than the above two methods-</p>

<ul>
  <li>No unnecessary code to define, load and parse the config inside the main file. If you’ve noticed, I am not using <code class="language-plaintext highlighter-rouge">pathlib</code> module inside the <code class="language-plaintext highlighter-rouge">cool_config_demo.py</code> now, which, as a consequence, has made the py file much cleaner.</li>
  <li>We are defining the datatypes of the variables inside the config now. If you look at the config class definitions (<code class="language-plaintext highlighter-rouge">Step1Conf</code> and <code class="language-plaintext highlighter-rouge">Step2Conf</code>), all of my paths are <code class="language-plaintext highlighter-rouge">pathlib.Path</code> objects now.</li>
  <li>If needed you can even validate your config inside the <code class="language-plaintext highlighter-rouge">config.py</code>. You can check if a directory/file exists or not.</li>
  <li>Config can be managed and organized in a better way without any redundant code. Because of the class inheritance, <code class="language-plaintext highlighter-rouge">data_dir</code> is available inside each individual step class. And, since output of step 1 is the input of step 2, I’ve only defined the out file path inside the step 1 class and used that variable inside step 2 class. This way redundancy is gone and there are no chances of missing changing the second after changing the first.</li>
  <li>
    <p>There is another advantage which you’ll appreciate when your code is supposed to be run in different environments (e.g., SIT, Pre-prod or Prod). You’ll make things run on your local, but many constants will change in other testing environments like database urls, HDFS paths, etc. To handle that, we can load shell env variables instead of changing defaults in <code class="language-plaintext highlighter-rouge">config.py</code>. Adding the following snippet inside <code class="language-plaintext highlighter-rouge">__init__</code> will achieve that.</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="bp">self</span><span class="p">.</span><span class="n">IGNITE_HOST</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"IGNITE_HOST"</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s">"localhost"</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>You can also implement <code class="language-plaintext highlighter-rouge">__str__</code> method in each class to print the whole config in whatever format you want.</li>
</ul>

<p>The only disadvantage of this method is that it sometimes feels like over-engineering for many small use cases. Although using this method in all my new project has given me many usage patterns.</p>]]></content><author><name>Shivam Rana</name></author><category term="Python" /><summary type="html"><![CDATA[There are a lot of ways to maintain configuration in Python project. I’ve recently learnt to do it in way that is not something new that I’ve found, but it was new to me. I’ll discuss my progression from basic hard-coding constants in a project to this new method.]]></summary></entry><entry><title type="html">Python Module Versioning and Releases</title><link href="https://trigonaminima.github.io/2020/06/py-versioning/" rel="alternate" type="text/html" title="Python Module Versioning and Releases" /><published>2020-06-15T00:00:00+00:00</published><updated>2020-06-15T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/06/py-versioning</id><content type="html" xml:base="https://trigonaminima.github.io/2020/06/py-versioning/"><![CDATA[<p>At work, these days, I am building some new Python ML modules to be used within other projects. When I am making modules for myself, I don’t care about the versioning or releases. But working in an environment when others are using your libraries then versioning is required. This is what I know now which I wish I had known before starting.</p>

<h2 id="developing-the-module">Developing the module</h2>

<p>How to test your module features before even committing them to the git.</p>

<h3 id="simple-py-script">Simple py script</h3>

<p>Create a <code class="language-plaintext highlighter-rouge">py</code> script in the root directory of your library and then test all the functionalities of the module. Or in every module file use the <code class="language-plaintext highlighter-rouge">if __name__ == "__main__"</code> block to test that file individually. This is the most basic thing that one does.</p>

<p>Now, let’s say you want to test this module inside another code base. With this method, your only choice is to move this module directory inside that other project and then import it. But there’s a better way.</p>

<h3 id="using-syspathinsert">Using <code class="language-plaintext highlighter-rouge">sys.path.insert</code></h3>

<p>The method is to insert the following code at the top of your file in other code base.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import sys
sys.path.insert(0,'/path/to/mod_directory')
</code></pre></div></div>

<p>Here you have added the path to the directory where you module resides in the system path where python searches whenever you import something. This way you can directly import the module to check if it works properly inside that code base and make changes accordingly to the module. Remember though, this should not be the way you import code inside the actual code base. This method is just used to test the module inside other project. There are other <a href="https://stackoverflow.com/q/1893598/2650427">similar</a> methods.</p>

<p>But there’s a more efficient method.</p>

<h3 id="development-mode">Development Mode</h3>

<ul>
  <li>Reading: <a href="https://packaging.python.org/guides/distributing-packages-using-setuptools/#working-in-development-mode">Working in “development mode”</a></li>
  <li>Reading: <a href="https://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode"><code class="language-plaintext highlighter-rouge">distutils</code> Develop</a></li>
</ul>

<p><strong>TL;DR</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pip install -e .</code></li>
  <li>In the above command<code class="language-plaintext highlighter-rouge">-e</code> is for <code class="language-plaintext highlighter-rouge">--editable</code> and <code class="language-plaintext highlighter-rouge">.</code> is the present directory (run the command from your module directory).</li>
  <li>The command a link to the current directory in python packages so that it’s available on <code class="language-plaintext highlighter-rouge">sys.path</code></li>
  <li><code class="language-plaintext highlighter-rouge">pip uninstall -e .</code></li>
</ul>

<p>or</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">setup.py develop</code></li>
  <li>Installs module in the development mode which means it can us updated and reinstalled without doing in uninstall.</li>
  <li><code class="language-plaintext highlighter-rouge">setup.py develop --uninstall</code></li>
</ul>

<h2 id="module-versioning">Module Versioning</h2>

<p>Next is, how do you version your module? There are many versioning techniques. Which ones are accepted by pip? When should we bump the version numbers?</p>

<ul>
  <li>Reading: <a href="https://packaging.python.org/guides/distributing-packages-using-setuptools/#semantic-versioning-preferred">Preferred way - Semantic Versioning</a></li>
  <li>Reading: <a href="https://semver.org/">Semantic Versioning</a></li>
</ul>

<p><strong>TL;DR</strong></p>

<ul>
  <li>3-part <code class="language-plaintext highlighter-rouge">MAJOR.MINOR.MAINTENANCE</code> numbering scheme
    <ul>
      <li><code class="language-plaintext highlighter-rouge">MAJOR</code> version when they make incompatible API changes,</li>
      <li><code class="language-plaintext highlighter-rouge">MINOR</code> version when they add functionality in a backwards-compatible manner, and</li>
      <li><code class="language-plaintext highlighter-rouge">MAINTENANCE</code> version when they make backwards-compatible bug fixes.</li>
    </ul>
  </li>
  <li>Major version zero (<code class="language-plaintext highlighter-rouge">0.y.z</code>) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.</li>
  <li>Version <code class="language-plaintext highlighter-rouge">1.0.0</code> defines the public API. The way in which the version number is incremented after this release is dependent on this public API and how it changes.</li>
  <li>Patch version <code class="language-plaintext highlighter-rouge">Z</code> (<code class="language-plaintext highlighter-rouge">x.y.Z | x &gt; 0</code>) MUST be incremented if only backwards compatible bug fixes are introduced. A bug fix is defined as an internal change that fixes incorrect behavior.</li>
  <li>Minor version <code class="language-plaintext highlighter-rouge">Y</code> (<code class="language-plaintext highlighter-rouge">x.Y.z | x &gt; 0</code>) MUST be incremented if new, backwards compatible functionality is introduced to the public API. It MUST be incremented if any public API functionality is marked as deprecated. It MAY be incremented if substantial new functionality or improvements are introduced within the private code. It MAY include patch level changes. Patch version MUST be reset to 0 when minor version is incremented.</li>
  <li>Major version <code class="language-plaintext highlighter-rouge">X</code> (<code class="language-plaintext highlighter-rouge">X.y.z | X &gt; 0</code>) MUST be incremented if any backwards incompatible changes are introduced to the public API. It MAY also include minor and patch level changes. Patch and minor version MUST be reset to 0 when major version is incremented.</li>
</ul>

<h2 id="git-release-tagging">Git Release Tagging</h2>

<p>Git tagging is an effective way to bookmark different stages of your code base. Reached a version, git tag it with all the meta information to keep a history about the release. You cannot change the history which is part of a git tag - meaning, you freeze your code base till the commit you tagged.</p>

<ul>
  <li>Reading: <a href="https://www.atlassian.com/git/tutorials/inspecting-a-repository/git-tag">Git tag</a></li>
</ul>

<p>Gist:</p>

<ul>
  <li>Tagging
    <ul>
      <li>Essentially bookmarking in the code base.</li>
    </ul>
  </li>
  <li>Two tag types:
    <ul>
      <li><strong>annotated</strong>: stores extra metadata in git database.</li>
      <li><strong>lightweight</strong>: stores only the hash of the commit it refers to.</li>
    </ul>
  </li>
  <li>Create Annotated Tags
    <ul>
      <li><code class="language-plaintext highlighter-rouge">git tag -a v1.0.0 &lt;commit_hash&gt;</code></li>
      <li><code class="language-plaintext highlighter-rouge">git tag -a v1.0.0</code> (uses current commit)</li>
      <li><code class="language-plaintext highlighter-rouge">git tag -a v1.0.0 -m "Releasing version v1.0.0"</code></li>
    </ul>
  </li>
  <li>Creating Lightweight Tags
    <ul>
      <li><code class="language-plaintext highlighter-rouge">git tag v1.0.0</code> (uses current commit)</li>
      <li><code class="language-plaintext highlighter-rouge">git tag v1.0.0 &lt;commit_hash&gt;</code></li>
    </ul>
  </li>
  <li>Listing Tags
    <ul>
      <li><code class="language-plaintext highlighter-rouge">git tag</code></li>
      <li><code class="language-plaintext highlighter-rouge">git tag -n3</code> (shows tag messages as well)</li>
      <li><code class="language-plaintext highlighter-rouge">git show &lt;tag_identifier&gt;</code> (specifig tag details)</li>
    </ul>
  </li>
  <li>Update previous tag
    <ul>
      <li><code class="language-plaintext highlighter-rouge">git tag -a -f v1.0.0 &lt;commit_hash&gt;</code></li>
    </ul>
  </li>
  <li>Viewing state of repo at given tag
    <ul>
      <li><code class="language-plaintext highlighter-rouge">git checkout v1.0.0</code></li>
    </ul>
  </li>
  <li>Deleting tag
    <ul>
      <li><code class="language-plaintext highlighter-rouge">git tag -d v1.0.0</code></li>
    </ul>
  </li>
  <li>Publishing tags to github/gitlab
    <ul>
      <li><code class="language-plaintext highlighter-rouge">git push &lt;location&gt; &lt;tag_identifier&gt;</code> (e.g., <code class="language-plaintext highlighter-rouge">git push origin v1.0.0</code>)</li>
      <li><code class="language-plaintext highlighter-rouge">git push &lt;tag_identifier&gt;</code></li>
      <li><code class="language-plaintext highlighter-rouge">git push --tags</code></li>
    </ul>
  </li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="Python" /><summary type="html"><![CDATA[At work, these days, I am building some new Python ML modules to be used within other projects. When I am making modules for myself, I don’t care about the versioning or releases. But working in an environment when others are using your libraries then versioning is required. This is what I know now which I wish I had known before starting.]]></summary></entry><entry><title type="html">Flask in Production</title><link href="https://trigonaminima.github.io/2020/05/flask-prod/" rel="alternate" type="text/html" title="Flask in Production" /><published>2020-05-29T00:00:00+00:00</published><updated>2020-05-29T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/05/flask-prod</id><content type="html" xml:base="https://trigonaminima.github.io/2020/05/flask-prod/"><![CDATA[<p>After coding your flask app, when you do flask run, you get the following warning:</p>

<blockquote>
  <p>WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.</p>
</blockquote>

<p>Here want to discuss why it warns not to use the development server in production and what to do instead. Along the way, I’ll also look into the whole python application production setup and why it is the way it is.</p>

<h2 id="bundled-server">Bundled Server</h2>

<p>When you do <code class="language-plaintext highlighter-rouge">flask run</code> (or <code class="language-plaintext highlighter-rouge">python myapp.py</code>), Flask uses <a href="https://palletsprojects.com/p/werkzeug/">Werkzeug’s</a> development server. Flask documentation has a section on <a href="https://flask.palletsprojects.com/en/1.1.x/deploying/">Deployment Options</a> which at the top asks not to use the built-in server:</p>

<blockquote>
  <p>While lightweight and easy to use, <strong>Flask’s built-in server is not suitable for production</strong> as it doesn’t scale well. Some of the options available for properly running Flask in production are documented here.</p>
</blockquote>

<p>The reasons of why not to use the development server (relevant <a href="https://stackoverflow.com/a/12269934/2650427">SO answer</a>):</p>

<ul>
  <li>It will not handle more than one request at a time by default.</li>
  <li>If you leave debug mode on and an error pops up, it opens up a shell that allows for arbitrary code to be executed on your server (think os.system(‘rm -rf /’)).</li>
  <li>The development server doesn’t scale well.</li>
</ul>

<p>So what’s the solution? Solution is WSGI. Let’s understand, why we need WSGI. The next few sections are largely derived (at many places, reproduced as is) from this <a href="https://www.reddit.com/r/Python/comments/8bb102/why_shouldnt_one_use_flask_bottle_django_etc/dx5qklz/">reddit post</a>.</p>

<h2 id="problem-with-normal-web-servers">Problem with normal web servers</h2>

<p>There are normal web servers like, Apache or nginx, which are built to handle web requests. They run on port 80 and handle static files efficiently. So, if you are serving any static assets like images or videos, need low-level caching, or have higher concurrency demands, it’s recommended to use a webserver like <a href="http://nginx.org/">nginx</a> and have it handle all of your requests.</p>

<p>Problem is, these servers don’t know what to do with python applications (beyond simple <a href="https://en.wikipedia.org/wiki/Common_Gateway_Interface">cgi</a>).</p>

<p>The problem of normal web servers is solved by application servers.</p>

<h2 id="application-servers">Application Servers</h2>

<p>Application servers can run python applications. As an absolute minimum they know how to keep python interpreter in the memory, so that it does not need to be restarted on each request. Usually application servers can also start multiple processes and handle multithreading etc. Application servers can not run on port 80, and by default they are not good with handling static files.</p>

<h2 id="general-request-process">General Request Process</h2>

<p><span style="display:block;text-align:center">
<img src="https://trigonaminima.github.io/assets/2020-05/flask_prod_01.png" alt="output-agreement-games" />
</span></p>

<p>When you request a url using browser or any other interface, this request is proxied to an application server, application server engages your python application, and response is proxied back to Apache and then returned to you.</p>

<p>At first, application servers were not standardized. At some point people settled on <a href="https://en.wikipedia.org/wiki/Web_Server_Gateway_Interface">WSGI</a> specification that defined how (WSGI) application server should interact with (WSGI) python application. Since then all python web frameworks have focused on creating WSGI applications that can be published by different WSGI servers.</p>

<p>Gunicorn, uwsgi, waitress, wsgiref.simple_server.WSGIServer etc all are examples of application servers that implement WSGI specification.</p>

<p>When you do <code class="language-plaintext highlighter-rouge">flask run</code>, you are actually starting a development WSGI server that comes with Flask by default. And it publishes your WSGI application. This development WSGI server is very limited. That’s why you need to use a real one for production.</p>

<h2 id="request-process-with-a-wsgi-server">Request Process with a WSGI Server</h2>

<p>Using a WSGI server, the process looks something like this:</p>

<p><span style="display:block;text-align:center">
<img src="https://trigonaminima.github.io/assets/2020-05/flask_prod_02.png" alt="output-agreement-games" />
</span></p>

<p>There are standalone WSGI servers, like gunicorn. And as I said, normally you proxy your requests from a web server to an application server.</p>

<p>There are WSGI servers that are coupled with web servers. They have a special way to interact with web server machinery and get directly into request handling. So, they have like a direct bridge, and external proxying is not necessary. In this sense</p>

<ul>
  <li><a href="https://modwsgi.readthedocs.io/en/develop/">mod_wsgi</a> is coupled with Apache, and</li>
  <li>uWSGI with nginx</li>
</ul>

<p><span style="display:block;text-align:center">
<img src="https://trigonaminima.github.io/assets/2020-05/flask_prod_03.png" alt="output-agreement-games" />
</span></p>

<h3 id="deployment-option-mod_wsgi-apache">Deployment option: mod_wsgi (Apache)</h3>

<p><a href="https://flask.palletsprojects.com/en/1.1.x/deploying/mod_wsgi/">Flask documentation</a> suggests using mod_wsgi if we are using <a href="https://httpd.apache.org/">Apache</a> webserver. The <a href="https://modwsgi.readthedocs.io/en/develop/">mod_wsgi homepage</a> says this:</p>

<blockquote>
  <p>The mod_wsgi package implements a simple to use Apache module which can host any Python web application which supports the Python WSGI specification.</p>
</blockquote>

<p>This <a href="https://modwsgi.readthedocs.io/en/develop/user-guides/quick-configuration-guide.html">Quick Configuration Guide</a> explains how to make a basic python application which uses Apache webserver with mod_wsgi (WSGI) application server to host a (simple) python application.</p>

<p>If the application is hosted using Apache, and application fails for some reason, then automatic application restarts are handled by the Apache.</p>

<h3 id="deployment-option-gunicorn-nginx">Deployment option: Gunicorn (nginx)</h3>

<blockquote>
  <p>Gunicorn ‘Green Unicorn’ is a Python WSGI HTTP Server for UNIX. It’s a pre-fork worker model. The Gunicorn server is broadly compatible with various web frameworks, simply implemented, light on server resources, and fairly speedy.</p>
</blockquote>

<p>Although, <a href="https://gunicorn.org/">Gunicorn</a> can handle HTTP requests, they strongly suggest to use it behind nginx. Since, nginx and Gunicorn, don’t handle the automatic application restarts if it fails for some reason, you’ll need to add <a href="http://supervisord.org">supervisor</a> into the mix.</p>

<h3 id="deployment-option-uwsgi-nginx">Deployment option: uWSGI (nginx)</h3>

<p>According to the <a href="https://flask.palletsprojects.com/en/1.1.x/deploying/uwsgi/">Flask documentation</a>, FastCGI is a deployment option on servers like <a href="https://nginx.org/">nginx</a>, <a href="https://www.lighttpd.net/">lighttpt</a> and <a href="http://cherokee-project.com/">cherokee</a>.</p>

<p>uWSGI is a protocol as well as an application server. The application server can serve uWSGI, FastCGI, and HTTP protocols. Most popular uWSGI server is <a href="https://uwsgi-docs.readthedocs.io/en/latest/">uwsgi</a>. Although, uwsgi supports HTTP requests, a proper webserver should be used as it’s not as good as a webserver at hosting static files. Since, nginx and uwsgi, don’t handle the automatic application restarts if it fails for some reason, you’ll need to add <a href="http://supervisord.org">supervisor</a> into the mix.</p>

<h3 id="deployment-option-fastcgi-nginx">Deployment option: FastCGI (nginx)</h3>

<p>According to the <a href="https://flask.palletsprojects.com/en/1.1.x/deploying/fastcgi/">Flask documentation</a>, FastCGI is a deployment option on servers like <a href="https://nginx.org/">nginx</a>, <a href="https://www.lighttpd.net/">lighttpt</a> and <a href="http://cherokee-project.com/">cherokee</a>. It can also work with Apache, but if you’re using Apache webserver then it’s recommended to go with mod_wsgi.</p>

<p>FastCGI is a protocol as well as an application server. The application sever serves the FastCGI protocol. Most popular FastCGI application server is <a href="https://pypi.org/project/flup/">flup</a>.</p>

<p>Since, nginx and FastCGI, don’t handle the automatic application restarts if it fails for some reason, you’ll need to add <a href="http://supervisord.org">supervisor</a> into the mix.</p>

<p><br /></p>

<p>General patterns I observed being in use:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">apache</code> + <code class="language-plaintext highlighter-rouge">mod_wsgi</code></li>
  <li><code class="language-plaintext highlighter-rouge">nginx</code> + <code class="language-plaintext highlighter-rouge">uwsgi</code></li>
  <li><code class="language-plaintext highlighter-rouge">nginx</code> + <code class="language-plaintext highlighter-rouge">uwsgi</code> + <code class="language-plaintext highlighter-rouge">supervisor</code></li>
  <li><code class="language-plaintext highlighter-rouge">nginx</code> + <code class="language-plaintext highlighter-rouge">gunicorn</code></li>
  <li><code class="language-plaintext highlighter-rouge">nginx</code> + <code class="language-plaintext highlighter-rouge">gunicorn</code> + <code class="language-plaintext highlighter-rouge">supervisor</code></li>
</ul>

<h2 id="resources-which-helped-me-with-the-above-basics">Resources which helped me with the above basics</h2>

<ul>
  <li><a href="https://stackoverflow.com/q/12269537/2650427">Is the server bundled with Flask safe to use in production?</a></li>
  <li><a href="https://www.reddit.com/r/Python/comments/8bb102/why_shouldnt_one_use_flask_bottle_django_etc/dx5qklz/">Why shouldn’t one use Flask, Bottle, Django, etc. directly, and always use WSGI?</a></li>
  <li><a href="https://flask.palletsprojects.com/en/1.1.x/">Flask Documentation</a></li>
  <li><a href="https://uwsgi-docs.readthedocs.io/en/latest/WSGIquickstart.html">Quickstart for Python/WSGI applications</a></li>
  <li><a href="https://www.toptal.com/flask/flask-production-recipes">Zero to Hero: Flask Production Recipes</a></li>
</ul>

<h2 id="deploying-flask-app-through-docker">Deploying Flask app through Docker</h2>

<p>I think, building a docker container for a Flask app is the right approach as you have to setup things one time and after that it just works. You just have to do <code class="language-plaintext highlighter-rouge">docker run</code> and voila!</p>

<p>Following articles helped helped me setup the Docker container for my Flask app with a production server.</p>

<ul>
  <li><a href="https://smirnov-am.github.io/running-flask-in-production-with-docker/">Running Flask in production with Docker</a></li>
  <li><a href="https://medium.com/@gabimelo/developing-a-flask-api-in-a-docker-container-with-uwsgi-and-nginx-e089e43ed90e">Developing a Flask API in a Docker container with uWSGI and NGINX</a></li>
  <li><a href="https://www.toptal.com/flask/flask-production-recipes">Zero to Hero: Flask Production Recipes</a></li>
</ul>

<p>These links are to help myself if I get stuck setting up things next time. I’ll document my process once I am more comfortable with the whole ecosystem and have explored the major options available.</p>]]></content><author><name>Shivam Rana</name></author><category term="Python" /><summary type="html"><![CDATA[After coding your flask app, when you do flask run, you get the following warning:]]></summary></entry><entry><title type="html">Packing a Backpack For Cycling Trips</title><link href="https://trigonaminima.github.io/2020/03/cycle-list-100/" rel="alternate" type="text/html" title="Packing a Backpack For Cycling Trips" /><published>2020-03-21T00:00:00+00:00</published><updated>2020-03-21T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/03/cycle-list-100</id><content type="html" xml:base="https://trigonaminima.github.io/2020/03/cycle-list-100/"><![CDATA[<p>This will be my first post about cycling. I started cycling almost a year back. I have done about a dozen of half-a-day trips. They usually range between 50-130 KMs, in and around Bangalore. This post is about what I have learned to carry with me on these trips. For some of the stuff there’s a back story as well. The final list is at the bottom.</p>

<h2 id="bike-helmet">Bike Helmet</h2>

<p>Helmet is the most important thing you have to carry. No compromises on this. You never know when an accident might happen. Doesn’t matter how carefully you are riding, you have to control for the other dimwits on the road.</p>

<h2 id="water">Water</h2>

<p>Second most important thing there is. You will <em>definitely</em> get dehydrated after 15-20 KMs and you’ll be in a rough situation if you don’t have access to water. So taking a water bottle with you is a given. But should you just carry one?</p>

<p>On my very first long trip when I didn’t know anything about long distance cycling, I took one bottle of water with me. It was a 60KM ride - through rural Bangalore, fringes of a forest and a hill climb passing a waterfall. Although, due to a few detours the ride turned into a 80KM one. On top of it, I started around 9 AM so there was a lot of sun after a few minutes. Yeah, I was a total amateur. Anyway, after the first 30-40 KMs, I was in a remote area with no shops around and I found my only water bottle about to be finished. Thank god India has a lot of temples; in temples you are sure to get drinking water. After going about 5-10 KMs without water, I found a temple. The temple was god send (or placed?). After about next 20-25 KMs I finished this refill as well. My next saviour was a fire station. After that refill, I safely reached home. I was very lucky that day.</p>

<p>Lesson I learned was, always take <strong>two water bottles</strong> with you for these long trips. And keep filling them whenever you get the chance.</p>

<h2 id="food">Food</h2>

<p>Next major thing on the list. The following things worked for me -</p>

<ul>
  <li>Bananas (half a dozen at least)</li>
  <li>Nuts (1 small box)</li>
  <li>Energy bars</li>
  <li>Cucumber (2 should do)</li>
  <li>Lemonade (1 bottle)</li>
  <li>Electral (2 sachets)</li>
</ul>

<p>Bananas are the instant energy food. Eat it and you’ll feel full and within minutes you’ll feel the energy as well. Nuts are stored energy food and take some time to give you the boost. I don’t usually buy energy bars, but I have seen a lot of people carrying those. Cucumber gives you that much needed hydration. Lemonade and Electral are to fill you up with the salts that you loose while sweating. If you’re going light on food to carry, then have something to eat before you leave.</p>

<p>In the same first trip, before starting, I had a bowl of oats and some nuts. And just took a Red Bull can with me. Naive me thought: “Red Bull will give me wings when needed.” With the same reasoning, I didn’t buy anything on the way as well. When I felt hungry it was too late. I saved Red Bull till the temple where I filled up my water bottle. Obviously, there were no wings, but it did it’s job. Later that day, a cyclist friend told me that you should not drink Red Bull or any other energy drinks on such trips.</p>

<h2 id="money">Money</h2>

<p>I keep some loose cash with me. No wallet. What if I get mugged somewhere? For the same reason, no card and no other valuable shit.</p>

<h2 id="id">ID</h2>

<p>I always carry a government issued ID card. It’ll help if you are unfortunate and get in some accident or other untoward circumstances.</p>

<h2 id="earphones">Earphones</h2>

<p>This is my requirement personally. I like to listen to the music while cycling. They are non-noise cancelling ones, so I am aware of the happenings on the road - honking, shouts, etc.</p>

<p>Earphones also serve another purpose - navigation. If I want to follow some directions, I put on voice navigation on Google Maps and just follow where ever the Maps lady asks me to. Saves my phone’s battery. Saves me from looking into the phone screen again and again. That’s why I haven’t bought a phone carry case for my bike.</p>

<h2 id="spare-tube--tyre-levers--patch-kit">Spare Tube + Tyre Levers + Patch Kit</h2>

<p>When I first bought my bike, the over-confident me thought that, I will ride my bike carefully and will not let it get punctured very soon. I was successful for one or two months. On the third or fourth long trip, the inevitable happened when I went through a dried-up lake bed (not a grassland). I did some exploring in and around the lake, safely reached home, washed the bike and slept. Then later in the evening while lubing the chain I saw the deflated tyre. It was disheartening to see the flat tyre. I hadn’t realised the grassland had patches of thorny grass as well. Surprisingly, even after penetration by two thorns the tyre help up throughout my ride. The bike shop guy gave a few suggestions at the time of repair -</p>

<ol>
  <li>Always carry a spare tube;</li>
  <li>Learn how to change the tube;</li>
  <li>Carry tyre levers to help with the tube change;</li>
  <li>Carry a patch kit.</li>
</ol>

<p>Changing a tube is quite easy. But even if you don’t know how to change the tube, you should carry a spare. That way you can at least find a repair shop and get your tyre fixed. If you’re thinking that you’ll get a new tube from the repair guy itself, then it’s quite likely that the right size tube will not be available. And if you get your tyre fixed with that generic tube, there’s a higher chance of getting another flat and the possibility of damaging the rim as well. And getting a new rim <em>will</em> put a dent in your purse.</p>

<p>I learned to carry patch kit when one day on my morning office commute, I got a flat midway with a loud sound (most likely reason was a <a href="https://www.youtube.com/watch?v=9kumOPKsHC8">pinch flat</a>). I usually don’t carry a spare tube while biking to work so I had to carry my bike for 2-3 KMs to find a repair guy for a patch and then went to office. The patch gave up by the evening just after I got home from work. Bike shop guy reasoned that a patched tube doesn’t sit well with the tyre which can lead to another pinch flat or the patch itself might not be able to handle the tyre pressure as the tube and patch materials are different.</p>

<h2 id="other-cycle-accessories">Other Cycle accessories</h2>

<h3 id="cycle-lock">Cycle Lock</h3>

<p>Your cycle is precious. You’ve spent your hard-earned money on it. You should protect it from theft. Invest in a good bike lock.</p>

<h3 id="front-and-tail-lights">Front and Tail Lights</h3>

<p>For road safety reasons. If you value your life, learn to carry these lights. They especially help at night. Rear to let other vehicles know that there is someone small going on the road on a bike and be careful not to collide with it. Front light to enable you to see the road and for traffic to make you noticeable in their rear-view mirrors.</p>

<h2 id="other-accessories">Other Accessories</h2>

<p>Contains the items to prevent yourself from tanning. After going without this for a few months I noticed my dark skin so had to take some measures. Also you should prevent yourself from getting too much exposure from sun.</p>

<ul>
  <li>Arm Sleeves</li>
  <li>Balaclava/Face Mask/Neck Tube</li>
  <li>(Long) Cycling Shorts</li>
  <li>Sun Lotion</li>
</ul>

<h2 id="final-list">Final List</h2>

<p>Here’s the final list.</p>

<ul>
  <li>Bike Helmet</li>
  <li>Water Bottle (2x)</li>
  <li>Food
    <ul>
      <li>Bananas (half a dozen at least)</li>
      <li>Nuts (1 small box)</li>
      <li>Energy bars</li>
      <li>Cucumber (2 should do)</li>
      <li>Lemonade (1 bottle)</li>
      <li>Electral (2 sachets)</li>
    </ul>
  </li>
  <li>Money (Loose cash)</li>
  <li>Government ID</li>
  <li>Earphones</li>
  <li>Spare Tube</li>
  <li>Tyre Levers</li>
  <li>Patch Kit</li>
  <li>Cycle Lock</li>
  <li>Front and Tail Lights</li>
  <li>Arm Sleeves</li>
  <li>Balaclava/Face Mask/Neck Tube</li>
  <li>(Long) Cycling Shorts</li>
  <li>Sun Lotion</li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="Cycling" /><summary type="html"><![CDATA[This will be my first post about cycling. I started cycling almost a year back. I have done about a dozen of half-a-day trips. They usually range between 50-130 KMs, in and around Bangalore. This post is about what I have learned to carry with me on these trips. For some of the stuff there’s a back story as well. The final list is at the bottom.]]></summary></entry><entry><title type="html">Gradient Descent</title><link href="https://trigonaminima.github.io/2020/03/grad-desc-basic/" rel="alternate" type="text/html" title="Gradient Descent" /><published>2020-03-15T00:00:00+00:00</published><updated>2020-03-15T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/03/grad-desc-basic</id><content type="html" xml:base="https://trigonaminima.github.io/2020/03/grad-desc-basic/"><![CDATA[<p>Recently, I realised that I have used majority of classical machine learning algorithms using scikit, but I haven’t actually implemented them myself. I know the basics of the algorithms, but were I to implement them, I’d most likely fail. There are always a lot <em>gotchas</em>. So, I am going to start with it now. The first stop is Gradient Descent.</p>

<p>Gradient Descent is a method of reaching a minima of a function. It’s a method to find the minima (global or local) of a given function. It is guaranteed to find a minima for convex functions. General algorithm is given a function to minimize, you take a step towards the negative of the gradient (or slope) of the function. If you keep doing it then eventually you’ll reach a minima. It works because when we move in the negative direction of the gradient, we’ll take the fastest route to reach the lowest value of the function. Watch this <a href="https://youtu.be/rIVLE3condE">Gradient Descent</a> video by Andrew NG to get a more intuitive understanding of the process.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Until convergence, repeat
2. Take gradient descent step for all variables
3. Update the variables with the new variables
</code></pre></div></div>

<p>For example, if we have a function, f(x,y) then the Gradient Descent will operate as follows</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Until f(x,y) is minimum, repeat
2. Gradient descent step
    1. x_new obtained after gradient descent step for x
    2. y_new obtained after gradient descent step for y
4. update x and y with x_new and y_new
</code></pre></div></div>

<p>The number of steps taken to reach the minima depends on learning rate (\(\alpha\)). The larger the value of \(\alpha\), the bigger the steps will be and it’ll find the minima quickly. Opposite is also true: smaller the \(\alpha\), the smaller the steps will be and it’ll take more time to reach minima. You have to find a balance though, because of a bigger \(\alpha\), gradient descent might fail to converge and a smaller \(\alpha\) will find a more accurate minima, but at the same time will take more time to converge. The following gradient descent step pseudo code shows how the \(\alpha\) is used.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Take variable and alpha as input
2. new_variable = variable - alpha * gradient
3. return new_variable
</code></pre></div></div>

<p>Usually, instead of checking for convergence we use number of iterations to end the gradient descent loop. Number of iterations parameter tells the gradient descent how many steps to go. There might be cases where it is caught in an infinite loop: either it’s diverging or minima is at infinity. So to prevent that from happening gradient descent will stop after a set number of iterations.</p>

<h2 id="gradient-descent-parameters">Gradient Descent Parameters</h2>

<ol>
  <li>Learning Rate (\(\alpha\))</li>
  <li>Number of Iterations</li>
</ol>

<h2 id="implementation">Implementation</h2>

<p>Generally, the scenario is to decide on a hypothesis function. This hypothesis function will have some parameters for which we want the optimal values according to our training data. To find the optimal values, we’ll have a cost function which evaluates the hypothesis for a given set of parameters and gives us a score. We have to minimize this score to reach the optimal values of the parameters to be used in the hypothesis.</p>

<p>Lets first implement the gradient descent loop. We take the following function parameters:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">x</code>: input data</li>
  <li><code class="language-plaintext highlighter-rouge">y</code>: target column</li>
  <li><code class="language-plaintext highlighter-rouge">thetas</code>: the parameters to be optimized</li>
  <li><code class="language-plaintext highlighter-rouge">alpha</code>: learning rate</li>
  <li><code class="language-plaintext highlighter-rouge">num_iters</code>: number of iterations to end the gradient descent loop after</li>
</ul>

<p>We first add a column of ones in <code class="language-plaintext highlighter-rouge">x</code> as a bias unit. Then we make empty arrays to hold the historical weights and cost. Now the loop starts; we run it until we reach the convergence or finish all the <code class="language-plaintext highlighter-rouge">num_iters</code> iterations. Within the loop, we call the <code class="language-plaintext highlighter-rouge">grad_desc_step</code> function to update the <code class="language-plaintext highlighter-rouge">thetas</code>. Once we get the updates, we save them and then move on to the next iteration if not already converged.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">grad_desc</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">thetas</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">num_iters</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
    <span class="s">"""
    Gradient descent loop
    """</span>
    <span class="c1"># adds bias (=1) column to the input data
</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">x</span><span class="p">))</span>

    <span class="c1"># empty arrays to store thetas and costs
</span>    <span class="n">theta_updates</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">num_iters</span><span class="p">,</span> <span class="n">thetas</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
    <span class="n">costs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">num_iters</span><span class="p">)</span>

    <span class="n">theta_updates</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">thetas</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span>
    <span class="n">costs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">cost_func</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">thetas</span><span class="p">)</span>

    <span class="c1"># gradient descent loop
</span>    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">num_iters</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
        <span class="n">thetas</span> <span class="o">=</span> <span class="n">grad_desc_step</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">thetas</span><span class="p">,</span> <span class="n">alpha</span><span class="p">)</span>
        <span class="n">cur_cost</span> <span class="o">=</span> <span class="n">cost_func</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">thetas</span><span class="p">)</span>

        <span class="n">theta_updates</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">thetas</span>
        <span class="n">costs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">cur_cost</span>

        <span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="n">costs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">costs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">&lt;</span> <span class="mf">1e-3</span><span class="p">:</span>
            <span class="k">break</span>

    <span class="k">return</span> <span class="n">theta_updates</span><span class="p">[:</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">,:],</span> <span class="n">costs</span><span class="p">[:</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Now we’ll discuss the gradient descent step. We assume we have a function which calculates the gradient of the cost function - <code class="language-plaintext highlighter-rouge">cost_func_grad</code>. This function takes <code class="language-plaintext highlighter-rouge">x</code>, <code class="language-plaintext highlighter-rouge">y</code> and <code class="language-plaintext highlighter-rouge">thetas</code> as parameters to calculate the gradient values for the given <code class="language-plaintext highlighter-rouge">thetas</code>. We use this function to get the gradient and then update the <code class="language-plaintext highlighter-rouge">thetas</code> by moving towards the negative of the gradient. The step is governed by the <code class="language-plaintext highlighter-rouge">alpha</code> or learning rate. We return the updated <code class="language-plaintext highlighter-rouge">thetas</code>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">grad_desc_step</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">thetas</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span>
    <span class="s">"""
    Updates the parameters once.
    """</span>
    <span class="c1"># Calculate the gradient
</span>    <span class="n">grad</span> <span class="o">=</span> <span class="n">cost_func_grad</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">thetas</span><span class="p">)</span>

    <span class="c1"># updating the parameters
</span>    <span class="n">thetas</span> <span class="o">=</span> <span class="n">thetas</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">grad</span>
    <span class="k">return</span> <span class="n">thetas</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>So this is the generic implementation of the gradient descent. In the future posts, I’ll use this same function to implement linear and logistic regression.</p>]]></content><author><name>Shivam Rana</name></author><category term="ML" /><summary type="html"><![CDATA[Recently, I realised that I have used majority of classical machine learning algorithms using scikit, but I haven’t actually implemented them myself. I know the basics of the algorithms, but were I to implement them, I’d most likely fail. There are always a lot gotchas. So, I am going to start with it now. The first stop is Gradient Descent.]]></summary></entry><entry><title type="html">How to Design a Game With a Purpose</title><link href="https://trigonaminima.github.io/2020/02/gwap-2/" rel="alternate" type="text/html" title="How to Design a Game With a Purpose" /><published>2020-02-09T00:00:00+00:00</published><updated>2020-02-09T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/02/gwap-2</id><content type="html" xml:base="https://trigonaminima.github.io/2020/02/gwap-2/"><![CDATA[<p>This post is the next in series of my literature study of the <a href="https://en.wikipedia.org/wiki/Human-based_computation">Human Computation</a> and <a href="https://en.wikipedia.org/wiki/Human-based_computation_game">Games With a Purpose</a> field. Following is the list of previous posts I wrote on this:</p>

<ul>
  <li><a href="/2020/01/gwap-1/">Human Computation and Games with a Purpose</a></li>
</ul>

<p>In this current post, we’ll discuss the general guidelines of building a GWAP for some problems. As the authors point out in the paper, these are in no-way complete guidelines, but it’s a good starting point to tackle at least some computational problems utilizing Human Computation through these interfaces.</p>

<h2 id="recap-what-is-a-game-with-a-purpose-gwap">Recap: What is a Game with a Purpose (GWAP)?</h2>

<p>You should read the <a href="/2020/01/gwap-1/">previous post</a> where I have described the related terms. For the sake of completeness I’ll give a quick definition of a GWAP. It is a game (hopefully enjoyable) designed to generate useful data for the machines as a side effect of playing it. GWAPs are designed for the purpose of solving tasks that are (quite) difficult for computers, but really easy for humans to solve. This computation done by humans is also called Human-based Computation or Human Computation.</p>

<p>All of the below listed factors give a sense of scale of problems we can solve if we are able to build effective and enjoyable GWAPs.</p>

<ul>
  <li>Increasing proportion of world’s population has access to the internet;</li>
  <li>Certain tasks are impossible for computers, but easy for humans;</li>
  <li>People spend lots of time playing games on their devices;</li>
  <li>Widespread use of smartphones has increased the surface area furthermore.</li>
</ul>

<h2 id="featured-literature">Featured Literature</h2>

<ol>
  <li>Luis von Ahn and Laura Dabbish. <strong>Designing Games With A Purpose</strong>. Communications of the ACM, August 2008. [<a href="https://dl.acm.org/doi/pdf/10.1145/1378704.1378719?download=true">pdf</a>]</li>
</ol>

<h2 id="matter">Matter</h2>

<h3 id="previous-efforts-at-employing-human-computation">Previous Efforts at Employing Human Computation</h3>

<p>Seems like, the use of Human Computation to get something done is not new. Lets first discuss the previous efforts at this problem and their success.</p>

<h4 id="distributed-collaboration-by-individuals">Distributed Collaboration by Individuals</h4>

<p>When we have tasks/problems which are <strong>difficult</strong>, <strong>time consuming</strong> and <strong>nearly impossible</strong> for a <em>single person</em> (or a small group) to solve, <strong>collaboration</strong> is one way to get rid of that problem.  Some quite successful examples of such tasks-</p>

<ul>
  <li>
    <p>Open-Source Software Development</p>

    <p>Yeah, this is a big one. We are quite dependent on open-source software even if you dont exactly work in tech. The <em>motivation</em> for individuals for doing this distributed collaboration can be attributed mainly to <strong>altruism</strong>. These days <strong>personal brand</strong> and <strong>monetary gains</strong> are also motivating factors.</p>
  </li>
  <li>
    <p>Wikipedia (and other sister projects)</p>

    <p>Another big one. The project where individual contributors are creating an encyclopedia of the world. A lot of articles on variety of things in a variety of languages. What’s the motivation here? Again, <strong>altruism</strong>. I am putting all other reasons like saving the culture, researching something, knowledge archival, etc, under altruism.</p>
  </li>
  <li>
    <p>Stack Exchange Network</p>

    <p>Whoa, that’s another big one. The website where the community has created a <em>large</em> repository of issues faced by people and possible set of solutions for those issues. Motivations here are two-fold: the <em>asker</em> is getting her <strong>questions answered</strong> and the <em>giver</em> is driven by <strong>altruism</strong> and at the same time increasing her <strong>personal brand</strong> with the profile score. SE has also deployed a lot of gamification features which encourages you to contribute and do more <em>Human Computation</em>.</p>
  </li>
  <li>
    <p>Amazon Mechanical Turk</p>

    <p>This is a big one for <em>rich</em> researchers. They can put their tasks out there to be annotated or solved by a host of workers registered on the platform. Workers get a money out of each task solved. Motivation here is clearly the <strong>pecuniary gains</strong>.</p>
  </li>
</ul>

<h4 id="open-mind-initiative">Open Mind initiative</h4>

<p>Collaborative framework to build intelligent software by using human skills to train computers. Volunteers participate by providing answers to questions computer can’t answer (ex., what is in this image?). From the looks of the the <a href="https://wiki.p2pfoundation.net/Open_Mind_Initiative">website</a> the project seems defunct now.</p>

<p>Drawbacks when compared to the GWAP approach:</p>

<ul>
  <li><strong>Unpaid</strong> volunteers <strong>donating</strong> their time;</li>
  <li><strong>No guarantee</strong> that the information given by volunteers is correct.</li>
</ul>

<h4 id="making-work-fun">Making Work Fun</h4>

<p>This work focusses on the point maintained by Human Computer Interaction (HCI) researchers: importance of enjoyment and fun in user interfaces. This also includes the gamification of learning activities for children. Some research efforts in this direction:</p>

<ul>
  <li>StyleCam: Game like interaction with the software. [<a href="https://www.researchgate.net/publication/220423455_Game-like_navigation_and_responsiveness_in_non-game_applications">paper</a>]</li>
  <li>psDooM: Turning user interface into a game. A first-person-shooter style interface for system-administrator-related tasks. [<a href="http://psdoom.sourceforge.net/">paper</a>]</li>
</ul>

<p>The research shows that this concept works efficiently when there’s a <strong>tight interplay between the game interaction and the task to be finished</strong>.</p>

<h2 id="gwap-design-considerations">GWAP Design Considerations</h2>

<h3 id="enjoyable-game-play">Enjoyable Game Play</h3>

<p>GWAPs do not rely on altruism, personal branding or financial incentives. They instead rely on the <strong>human desire to be entertained</strong>. People play not because they are personally interested in solving an instance of a computational problem, but because they wish to be entertained. Keeping aside the philosophy of a game being <em>fun</em> or <em>enjoyable</em>, we should be able to measure if the game is <em>successful</em>.</p>

<h3 id="useful-and-correct-computation">Useful and Correct Computation</h3>

<p>Primary purpose of GWAPs is to get reliable data for for any computation problem it’s based on. A GWAP should <strong>encourage players</strong> to correctly perform the necessary steps to solve the computational problem. It should also involve a <strong>probabilistic guarantee</strong> that the game’s output is correct, even if players do not want it to be correct.</p>

<h2 id="gwap-design-templates">GWAP Design Templates</h2>

<h3 id="output-agreement-games">Output Agreement Games</h3>

<ol>
  <li><strong>Initial Setup</strong>
    <ul>
      <li>Two strangers randomly chosen from all potential players;</li>
      <li>In each round, both are given the <strong>same input</strong> and must produce outputs based on the input.</li>
    </ul>
  </li>
  <li><strong>Rules</strong>
    <ul>
      <li>game instructions indicate that players should try to <strong>produce the same output as their partners</strong>;</li>
      <li>players cannot see one another’s outputs;</li>
      <li>players can’t communicate with each other.</li>
    </ul>
  </li>
  <li><strong>Winning Condition</strong>
    <ul>
      <li>Both players have to produce the same output;</li>
      <li>They need not produce it at the same time, but must produce it at some point while the input is displayed on screen.</li>
    </ul>
  </li>
</ol>

<p>Here’s the visual display of the template reproduced from the paper linked at the top.</p>

<p><span style="display:block;text-align:center">
<img src="https://trigonaminima.github.io/assets/2020-02/gwap1.png" alt="output-agreement-games" />
</span></p>

<p><strong>How/Why it works?</strong></p>
<ul>
  <li>Since players <em>can’t communicate</em> and know nothing about each other, the <em>most easiest and most intuitive way</em> for both to produce the same output is by entering something about the only thing common between them, that is, the input.</li>
</ul>

<p><strong>How/Why is it enjoyable?</strong></p>
<ul>
  <li>Trying to agree on the same output with a partner is an enjoyable experience</li>
  <li>Game doesn’t ask the players to enter the correct output for a given input; players are encouraged to <em>think like each other</em> which encourages the <em>feeling of connection</em> with your partner during a game session.</li>
</ul>

<p><strong>How/Why is the computation correct?</strong></p>
<ul>
  <li>When players provide the same output, it partially verifies that the output is correct as it comes from two <strong>largely independent sources</strong></li>
</ul>

<h3 id="inversion-problem-games">Inversion Problem Games</h3>

<ol>
  <li><strong>Initial Setup</strong>
    <ul>
      <li>Two strangers randomly chosen from all potential players;</li>
      <li>In each round, one player is <strong>describer</strong> and the other player is <strong>guesser</strong>.</li>
    </ul>
  </li>
  <li><strong>Rules</strong>
    <ul>
      <li>Describer gets an input and based on that describer produces outputs that are sent to the guesser;</li>
      <li>The output from the describer should help the guesser produce the original input.</li>
    </ul>
  </li>
  <li><strong>Winning Condition</strong>
    <ul>
      <li>The guesser produces the input that was originally given to the describer.</li>
    </ul>
  </li>
</ol>

<p>Here’s the visual display of the template reproduced from the paper linked at the top.</p>

<p><span style="display:block;text-align:center">
<img src="https://trigonaminima.github.io/assets/2020-02/gwap2.png" alt="output-agreement-games" />
</span></p>

<p><strong>How/Why it works?</strong></p>
<ul>
  <li>Partners are successful only when the describer provides enough outputs for the guesser to guess the original input.</li>
</ul>

<p><strong>How/Why is it enjoyable?</strong></p>
<ul>
  <li>Having one player guess the input while the other describes it is an enjoyable experience (something similar to popular children’s game “20 questions”).</li>
  <li>Adding transparency for make it more enjoyable;
    <ul>
      <li>Displaying partner guesses to the describers and allowing them to indicate whether each guess is <em>hot</em> or <em>cold</em>;</li>
      <li>Increases social connection between the players;</li>
      <li>Doesn’t compromise the output correctness.</li>
    </ul>
  </li>
  <li>Alternation to handle asymmetric nature of game;
    <ul>
      <li>Each player in the pair performs a different task;</li>
      <li>One role might be more enjoyable (faster-paced or involves more interaction);</li>
      <li>To maintain the balance, switch player roles after each round: guesser becomes the describer and describer becomes the guesser.</li>
    </ul>
  </li>
</ul>

<p><strong>How/Why is the computation correct?</strong></p>
<ul>
  <li>Game structure encourages players to enter correct information;</li>
  <li>If outputs are incorrect or incomplete, the guesser will fail to make the right guess.</li>
</ul>

<h3 id="input-agreement-games">Input Agreement Games</h3>

<ol>
  <li><strong>Initial Setup</strong>
    <ul>
      <li>Two strangers randomly chosen from all potential players;</li>
      <li>In each round, both players are given inputs that are known by the game (but not by the players) to be the same or different.</li>
    </ul>
  </li>
  <li><strong>Rules</strong>
    <ul>
      <li>Players are instructed to produce outputs describing their input, so their partners are able to assess whether their inputs are same or different;</li>
      <li>Players see only each other’s outputs.</li>
    </ul>
  </li>
  <li><strong>Winning Condition</strong>
    <ul>
      <li>Both players correctly determine whether they have been given the same or different inputs.</li>
    </ul>
  </li>
</ol>

<p>Here’s the visual display of the template reproduced from the paper linked at the top.</p>

<p><span style="display:block;text-align:center">
<img src="https://trigonaminima.github.io/assets/2020-02/gwap3.png" alt="output-agreement-games" />
</span></p>

<p><strong>How/Why it works?</strong></p>
<ul>
  <li>Because players want to achieve winning condition, they each want their partner to be able to determine if their inputs are the same;</li>
  <li>This means that it’s in their own best interest to enter accurate outputs that appropriately describe their individual inputs.</li>
</ul>

<p><strong>How/Why is the computation correct?</strong></p>
<ul>
  <li>Scoring strongly penalizes incorrect guesses to discourage players from randomly guessing whether inputs are the same.</li>
  <li>A way of implementing this while maintaining positive scoring system: scores or combos of streaks of correct answers</li>
</ul>

<h2 id="gwap-design-increasing-player-enjoyment">GWAP Design: Increasing Player Enjoyment</h2>

<p>From the literature on motivation in psychology and organizational behavior, goals that are both <strong>well-specified and challenging</strong> lead to higher levels of <strong>effort</strong> and task <strong>performance</strong> than goals that are too easy or vague. Research on game-design principles also tells us to introduce <strong>challenge</strong> in the games and introducing challenge will translate into different <strong>game features</strong>.</p>

<h3 id="timed-response">Timed Response</h3>

<ul>
  <li>Complete a set number of problem instances within an assigned time limit.</li>
  <li>Time limit establishes an explicit goal that is not trivial for players to achieve is game is calibrated properly (number of tasks and time limit).</li>
  <li>Time limit and time remaining should be displayed throughout the game.</li>
</ul>

<h3 id="score-keeping">Score Keeping</h3>

<ul>
  <li>Use of points increases motivation and provides a clear connection among effort in the game, performance (achieving the winning condition), and outcomes.</li>
</ul>

<h3 id="player-skill-levels">Player Skill Levels</h3>

<ul>
  <li>Players are given skill levels or ranks.</li>
  <li>Following each game session, players are shown their current skill level and the number of points needed to reach the next level.</li>
  <li>From anecdotal experience, skill-level information strongly influenced player motivation and behavior;</li>
  <li>Also from anecdotal experience, many players continue playing just to reach a new rank.</li>
</ul>

<h3 id="high-score-lists-or-leaderboards">High score lists or Leaderboards</h3>

<ul>
  <li>Players with the highest number of points over a certain period of time;</li>
  <li>Hourly high-score list;</li>
  <li>Daily high-score list;</li>
  <li>All-time high-score list;</li>
  <li>These multi-level goals, varying in difficulty, provide strong, positive motivation for extended game play.</li>
</ul>

<h3 id="randomness">Randomness</h3>

<ul>
  <li>Random input selection brings,
    <ul>
      <li><strong>Varying difficulty</strong>, keeping the game interesting and engaging for expert and novice players alike.</li>
      <li><strong>Uncertainty</strong> about whether all inputs will be completed within the time limit</li>
    </ul>
  </li>
  <li>Random partner assignment
    <ul>
      <li>Ensures the uniqueness of each game session</li>
      <li>From anecdotal experience, during each game session, players develop a sense of their partners’ skill and way of playing (sense of connection), which is great motivation for repeated play.</li>
    </ul>
  </li>
</ul>

<h2 id="gwap-design-correctness-mechanisms">GWAP Design: Correctness Mechanisms</h2>

<ol>
  <li>Ensure output correctness (#correctness)</li>
  <li>Counter player collusion (#collusion)</li>
</ol>

<h3 id="random-matching">Random Matching</h3>
<p>Randomly matched players helps maintain the correctness in two ways:</p>

<ol>
  <li>Two known players can’t agree ahead of time on any cheating strategy as the probability is very low of them matching with each other. (#collusion)</li>
  <li>Probability of two or more cheaters using the same strategy being paired together will also be low. (#collusion)</li>
</ol>

<h3 id="player-testing">Player Testing</h3>
<ul>
  <li>Randomly present players those inputs for which all possible correct outputs are already known (test inputs); if output produced by a particular player does not match with the known values then something is fishy. (#correctness)</li>
  <li>With enough number of these <em>test inputs</em> presented to the players, we can <strong>guarantee with high probability</strong> that the output is correct. To illustrate-
    <ul>
      <li>Assume half of the inputs given to a player are test inputs.</li>
      <li>The probability that a new output by the player is correct, given that the player is correct on all the test inputs is at least 50%;</li>
      <li>This probability can be increased through repetition.</li>
    </ul>
  </li>
  <li>Similar technique is also used by Stack Exchange in their review queues. Read: <a href="https://meta.stackexchange.com/q/157121/352297">What are review tests (audits) and how do they work?</a></li>
</ul>

<h3 id="repetition">Repetition</h3>
<ul>
  <li>We can ensure the correctness of the output with <strong>high probability</strong> if we consider the output to be correct only when a certain number of players have said so. (#correctness)</li>
  <li>Example: Consider an output-agreement game;
    <ul>
      <li>for a given input, the game considers an output to be correct after <code class="language-plaintext highlighter-rouge">n</code> pairs have entered it</li>
      <li>Game knows that each pairs out of these <code class="language-plaintext highlighter-rouge">n</code> pairs, will enter a correct output with a probability of at least 50% (from player testing)</li>
      <li>The output is correct with probability of at least \((1-\frac{1}{2^n})\)</li>
    </ul>
  </li>
  <li>Stack Overflow uses this as well. For each review, it asks some <code class="language-plaintext highlighter-rouge">n</code> number of reviewers to review and then the takes next action.</li>
</ul>

<h3 id="taboo-outputs">Taboo Outputs</h3>
<ul>
  <li>For those inputs which can have multiple outputs, we have to ensure that the output space is covered sufficiently;</li>
  <li>Use of <em>taboo words</em> or <em>off-limits outputs</em> provides some guarantee that a larger proportion of all possible outputs will be entered by all players.</li>
  <li>Players are not allowed to enter the outputs present in the taboo words</li>
  <li>taboo outputs are presented in order to account for potential <strong>output-priming effects</strong> (in which the particular taboo outputs shown to the players influence the guesses they enter)</li>
</ul>

<h2 id="gwap-design-one-player">GWAP Design: One Player</h2>

<ul>
  <li>Paired game play makes GWAPs social in nature;</li>
  <li>Players are able to validate each other’s computation;</li>
  <li>Although, how to handle logistical issues in dyadic gameplay?
    <ul>
      <li>What if odd number of people are currently online?</li>
      <li>What if a partner is facing network issues is not working properly?</li>
    </ul>
  </li>
</ul>

<h3 id="solution-pre-recorded-games">Solution: Pre-recorded Games</h3>

<ul>
  <li>When two people are playing, the game should simply record every action they make, along with the relative timing of each action.</li>
  <li>In a single-player game pair a single player with the pre-recorded set of actions.</li>
  <li>This technique is easy to implement for Input and Output Agreement games;</li>
  <li>For inversion-problem games customised techniques are required because one of the players (the guesser) must dynamically respond to the other player’s (the describer) actions.</li>
</ul>

<h2 id="gwap-design-2-players-gwaps">GWAP Design: 2+ Players GWAPs</h2>

<ul>
  <li>Multiplayer versions are competitive.</li>
  <li><strong>Output-Agreement games</strong>: Modifying the winning condition such that the first two players who agree on the output are the winners of the round (and granted a higher number of points than the non-winners).</li>
  <li><strong>Inversion-Problem games</strong>: Substituting an individual guesser with an arbitrary number of players in the role of guesser, all racing to be first to correctly guess the input (winning condition).</li>
  <li><strong>Drawback</strong>: More players are working on the same computation which is a <strong>waste of computation cycles</strong>;
    <ul>
      <li>Although games can be designed in a way that also utilizes the repetition technique (discussed in the correctness mechanisms) within the same round. This way the waste would be less there.</li>
    </ul>
  </li>
</ul>

<h2 id="gwap-design-evaluation">GWAP Design: Evaluation</h2>

<p><strong>Given that two different GWAPs solve the same problem,
which one is best?</strong> Every GWAP associated with a computational problem, can be thought of as an <strong>algorithm</strong>: give an input and get an output. We can’t measure our <em>GWAP algorithm</em> using big-O type metric as it’s not clear what an atomic step in any GWAP is. On top of it, just a normal output or running metric is not enough. We’ll also need to quantify enjoyability factor as well.</p>

<h3 id="game-efficiency-throughput">Game Efficiency: Throughput</h3>
<ul>
  <li>Average number of problem instances solved, or input-output mappings performed, per human-hour</li>
  <li>A reasonable lengthy time period should be considered for taking the average to accound for learning curves and variations in player skill (people get faster at game play over time).</li>
  <li>Higher throughput should be preferred</li>
</ul>

<h3 id="quantifying-enjoyability-average-lifetime-play-alp">Quantifying Enjoyability: Average Lifetime Play (ALP)</h3>
<ul>
  <li>Since it is difficult to quantify enjoyability of any game we’ll take a proxy</li>
  <li>Overall amount of time the game is played by each player averaged across all people who have played it.</li>
</ul>

<h3 id="one-metric-expected-contribution">One Metric: Expected Contribution</h3>

<ul>
  <li>provides a more accurate direct assessment of how much people play the game and, in turn, how useful the game is for computational purposes; more effective than self-report questionnaire measures.</li>
  <li>Expected contribution indicates the average number of problem instances a single human player can be expected to solve by playing a particular game;</li>
  <li><code class="language-plaintext highlighter-rouge">Expected Contribution = Throughput * ALP</code></li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<ul>
  <li><em>First</em> general method of integrating a computational problem with a game.</li>
  <li>Authors recognize that this might not be an exhaustive list of templates out there.</li>
  <li><strong>In the current templates players are rewarded for thinking like other players which uses similarity as a way to ensure output correctness.</strong></li>
  <li>These approaches may not be proper/optimal/useful for tasks that require creativity, diverse viewpoints and perspectives</li>
  <li>There might be other problems which fall outside GWAP space.</li>
  <li>Current templates solve those problems which can be divided into subtasks thus making appealing because of its bite-sized nature.</li>
</ul>

<p>Some questions I got after reading this paper:</p>

<ul>
  <li><strong>What kind of problems need to be (or can be) solved this way?</strong> This may help us in determining if there is even a need to look for other templates. May be these templates are enough?</li>
  <li><strong>How to break problems into small chunks to be created into games?</strong> This is a crucial step while setting up the narrative of the games.</li>
  <li><strong>Can we get more ideas by studying the games played by the children?</strong> I think we all play some modified versions of those games added with the adult elements.</li>
  <li><strong>What do current game-designers think about this concept?</strong> Since they deal with so many players and so many game elements, they would have some new perspective to bring on the GWAPs. Sadly I dont know any game developer.</li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="Data" /><category term="GWAP" /><summary type="html"><![CDATA[This post is the next in series of my literature study of the Human Computation and Games With a Purpose field. Following is the list of previous posts I wrote on this:]]></summary></entry><entry><title type="html">PyData Bangalore Meetup #7</title><link href="https://trigonaminima.github.io/2020/01/pydata7/" rel="alternate" type="text/html" title="PyData Bangalore Meetup #7" /><published>2020-01-25T00:00:00+00:00</published><updated>2020-01-25T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/01/pydata7</id><content type="html" xml:base="https://trigonaminima.github.io/2020/01/pydata7/"><![CDATA[<p><a href="https://www.meetup.com/pydata-bangalore/">PyData (Bangalore)</a> is a community of users and developers of all things data - Data Science, Data Engineering, Machine Learning, Deep Learning, Data Ethics, Visualization, etc. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in our use cases. We all get together every month with 4 talks, a few lightening talks to share ideas and learn from each other.</p>

<p>We organized the <a href="https://www.meetup.com/pydata-bangalore/events/267686564/">first meetup of 2020</a> (7th overall) on Saturday, 18th Jan. Wonderful people at <a href="https://www.ab-inbev.com/">ABInBev</a>, Bangalore offered us their space to host this month’s meetup. Got a bit late in writing the proceedings, but here’s what happened throughout the session.</p>

<p>With the audience gathered, Shashi, from AbInBev, opened the session at 10:30 AM. He invited the first Speaker, <a href="https://github.com/hsakas">Akash Swamy</a>, to the stage. Akash is a Sr. Machine Learning Engineer working towards building scalable solutions for various clients. Check out his <a href="https://github.com/hsakas">github</a> for more details. The topic of Akash’s 30 minute talk was - <strong>Defensive Programming for Deep Learning with Pytorch</strong>. Slides are <a href="https://docs.google.com/presentation/d/1DPWiSbqlR-IVwn7-cx44ZXtr_Rkp0S5v00J_UsadGEg/edit?usp=sharing">here</a>. His talk was around how to have proper coding habits when you are putting ML/DL code in production.</p>

<ul>
  <li>First point he covered was <em>Code Modularity</em>. After doing your research on the model you’re going to use, divide it into different modules. For example, instead of creating one script or single class for all things related to AutoEncoder, create one class for Encoder, another for Decoder and last one for AutoEncoder. This way, if you want to create a Seq2Seq model you can always use Encoder and Decoder within a Seq2Seq class. So the modularity helps with less bugs and more robust production code.</li>
  <li>Next he mentioned the use of <a href="https://docs.python.org/3/library/typing.html"><em>Type Annotations</em></a> which are released in Python 3.5. You should define your annotations first which are made from the basic annotation types. For example, type annotation for a bounding box will be a list of tuple of floats. He also mentioned the use of <a href="https://github.com/python/mypy">mypy</a> to do static type checking.</li>
  <li>Another important part he mentioned was <em>Error Handling</em>. Instead of putting a lot of conditions in a single <code class="language-plaintext highlighter-rouge">assert</code> statement, break it into multiple <code class="language-plaintext highlighter-rouge">assert</code> statements. This way you can have separate and better error handling along with the proper error message for each type of error. Which helps while debugging.</li>
  <li>Last but not the least, <em>Unit Testing</em>. He discussed about how to go about unit testing your modules and why it’s important to have such test in Deep Learning.</li>
</ul>

<p>Followed by the presentation, Akash answered a few questions from the audience. After closing, next speaker, <a href="https://www.linkedin.com/in/tamojit-maiti-635691157/">Tamojit Maiti</a> was invited on the stage. Tamojit is a Masters student interning at AbInBev Bangalore this semester. Check out his <a href="https://www.linkedin.com/in/tamojit-maiti-635691157/">LinkedIn</a> for more details. Tamojit spoke for about 30 minutes on <strong>State-of-the-Art Dimensionality Reduction Techniques</strong>. Slides and associated notebooks are <a href="https://github.com/tamojit-maiti/PyData-Dimension_Reduction-Jan-2020">here</a>. Following is what he covered in his talk.</p>

<ul>
  <li>What is dimensionality reduction and why do we need it.</li>
  <li>He discussed matrix factorization for dimensionality reduction; its pros and cons</li>
  <li>Next was Manifold Learning. What is it? Pros and cons. A unified approach to Manifold learning.</li>
  <li>Multi-dimensional Scaling. How it works. What are the parameters. How to use the <code class="language-plaintext highlighter-rouge">scikit-learn</code> implementation of MDS. Its drawbacks.</li>
  <li>Locally Linear Embedding. How it works. What are the parameters. How to use the <code class="language-plaintext highlighter-rouge">scikit-learn</code> implementation of MDS. Its drawbacks.</li>
  <li>t-Distributed Stochastic Neighborhood Embedding. How it works.</li>
  <li>t-SNE. How it works. What are the hyperparameters. What are its drawbacks.</li>
  <li>How to select the best method? There’s no universal metric that works for everything as it’s very hard to quantify information loss after dimensionality reduction. Different loss function measure different things and depending on the use-case we decide what to pick.</li>
</ul>

<p>Followed by QA session, we had a 15 minute break (which is also a networking session). Within this break, Shashi, the venue host, took the attendees for the tour of the cool and funky AbInBev Bangalore office. After having all the tea-coffee everyone returned to their palces. And we started with the second round of talks.</p>

<p>Our next speaker was <a href="https://www.linkedin.com/in/manujosephv/">Manu Joseph</a>. Manu is a self-taught Data Scientist with
about 8+ years of professional experience currently a researcher at Thoucentric Analytics. Check out more about him at <a href="https://www.linkedin.com/in/manujosephv/">LinkedIn</a>. Manu’s 45 minute long talk’s topic was - <strong>Interpretability: Cracking open the Black Box</strong>. Slides are <a href="https://drive.google.com/file/d/1GnpWyHXNNx-wkRgNFFvzz-iVE69ML-bV/view">here</a>. He has also prepared a 3-part blog series which you find <a href="https://deep-and-shallow.com/2019/11/13/interpretability-cracking-open-the-black-box-part-i/">here</a>, <a href="https://deep-and-shallow.com/2019/11/16/interpretability-cracking-open-the-black-box-part-ii/">here</a> and <a href="https://deep-and-shallow.com/2019/11/24/interpretability-cracking-open-the-black-box-part-iii/">here</a>. Following are the talk highlights.</p>

<ul>
  <li>Why do we need explainable AI? Models are interacting with humans and humans might not understand some decisions made by machines. The interpretation of an ML algo is needed. And this is especially difficult in DL/NN based models.</li>
  <li>Transparent models: Linear Regression. Coefficients are interpretable. Feature importance.</li>
  <li>Transparent models: Decision Trees. Tree visualization using <a href="https://github.com/parrt/dtreeviz">dtreeviz</a>.</li>
  <li>Post-hoc Interpretation: Mean Decrease in Impurity. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret mean decrease in impurity.</li>
  <li>Post-hoc Interpretation: Drop Column Importance. What’s the algorithm. How to implement it. What are the advantages and disadvantages.</li>
  <li>Post-hoc Interpretation: Permutation Importance. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret the permutation importance.</li>
  <li>Post-hoc Interpretation: Partial Dependence Plots. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret the partial dependence plots.</li>
  <li>Post-hoc Interpretation: Local Interpretable Model-agnostic Explanations (LIME). What is it. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret it.</li>
  <li>Post-hoc Interpretation: Shapely Values and Shapely Additive Explanations (SHAP). What is it and how it’s a borrowed concept from Game Theory. How it works. Mathematical guarantee of Shapely values. Different versions of SHAP. Implementation. And how to interpret the plots and values.</li>
  <li>The ending slide was about Ethics in AI. Manu exhorted everyone to think about ethics and proper interpretability while building their models. Our modeling decisions might be really cause adverse effects for the consumers.</li>
</ul>

<p>Next 45 minute talk and the very last one was by <a href="https://twitter.com/ramya_ragupathy">Ramya Ragupathy</a>. She works with Humanitarian OpenStreetMap Team. Find out more about her at <a href="https://www.linkedin.com/in/manujosephv/">Twitter</a>. The title of her talk was <strong>OpenStreetMap Data Processing with Python</strong>. Slides are <a href="https://docs.google.com/presentation/d/1rC7WFn_w_QZaoJjriwvk3ykRiWKny1ip4pZXex82amo/mobilepresent#slide=id.g6454da2949_1_119">here</a>. Following are the highlights.</p>

<ul>
  <li>What is OpenStreetMap (OSM)? Basically, wikipedia of maps. An open-source, global and editable geodata source.</li>
  <li>Rapidly growing community</li>
  <li>Editable by anyone; How editing works.</li>
  <li>Who uses OSM?</li>
  <li>What is GeoData? basically things that have a location - roads, building, census boundaries, temporal events (cloud cover, geo-located tweets)</li>
  <li>Storage formats of geo data: Rastor. What is rastor data? data storing in pixels. What is a pixel? What info can it contain - color, height, slope, direction, etc. Commonly used in satellite imagery, weather data, etc. OSM provides rastor data in for of satellite imagery.</li>
  <li>Storage formats of geo data: Vector. What is vector data? stores geometry, attribute and location info. No pixelation occurs with zooming. Dynamically rendered. Geometric classes of vector data: point, line, polygon. Attributes associated with vectors. Each vectors and associated attributes are like a row in a table.</li>
  <li>Accessing OSM data. File formats - GeoJSON, Shape file, KML, PBF, GeoPackade. Python modules: <a href="https://github.com/Toblerity/Shapely">Shapely</a>, <a href="http://geopandas.org/">Geopandas</a>, <a href="https://github.com/Toblerity/Shapely">OSMnx</a>.</li>
  <li>Interactive maps with <a href="https://github.com/jwass/mplleaflet">mplleaflet</a> and <a href="https://python-visualization.github.io/folium/">folium</a>.</li>
</ul>

<p>After the talk, Ramya fielded various questions from the audience. Once they were over, lightening talks happened. The concept of lightening talks is, anyone can come up the stage and talk about anything. Anything they have recently learned, any project they are working on, anything they are working on at work. These are informal talk and you are not required to have any slides or anything. Just come on stage and speak. We had 3 such lightening talks.</p>

<ul>
  <li>One was Shashi, the venue host. He is a data science manager in AbInBev Bangalore. He spoke about what they do. What kind of problems they solve. What they will be doing. He also mentioned about a few openings in the team.</li>
  <li>Second was the talk about Code Quality in python code. I don’t remember his name (I am sorry buddy if you are reading this). He talked about context managers and debuggers. He was inspired by the Akash’s talk and decided to give this lightening talk.</li>
  <li>Third was about Azure logging in your code and how to work through the dashboard Azure provides. This speaker (sorry friend, can’t remember your name as well) works at AbInBev Bangalore and they use this azure logging and the dashboard in their projects.</li>
</ul>

<p>This was the end of the lightening talks. Two attendees also announced about job openings in their companies. After this, everyone dispersed for their pizzas provided by incredible folks at AbInBev. This is also a networking session where you get to meet and talk with the other attendees and the speakers.</p>

<p>Thus we had another successful session of PyData Bangalore. If you would like to speak at the future PyData meetup, post your proposal as an issue on this <a href="https://github.com/pydatabangalore/talks/issues?q=is%3Aopen+is%3Aissue">link</a>. Yeah we use open-source even for this :p. Once you press <code class="language-plaintext highlighter-rouge">New Issue</code>, you’ll be given a format, just fill according to it and one of the PyData team members will take is forward. As easy as that. And here are the attendees and the speakers in AbInBev office.</p>

<p><img src="https://trigonaminima.github.io/assets/2020-01/pydata.jpg" alt="attendees" /></p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[PyData (Bangalore) is a community of users and developers of all things data - Data Science, Data Engineering, Machine Learning, Deep Learning, Data Ethics, Visualization, etc. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in our use cases. We all get together every month with 4 talks, a few lightening talks to share ideas and learn from each other.]]></summary></entry><entry><title type="html">Human Computation and Games with a Purpose</title><link href="https://trigonaminima.github.io/2020/01/gwap-1/" rel="alternate" type="text/html" title="Human Computation and Games with a Purpose" /><published>2020-01-24T00:00:00+00:00</published><updated>2020-01-24T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2020/01/gwap-1</id><content type="html" xml:base="https://trigonaminima.github.io/2020/01/gwap-1/"><![CDATA[<p>I first read the terms <em>Human Computation</em> and <em>Games with a Purpose</em> while reading about crowd sourcing. Further reading on these terms and a discussion with my boss led me to <a href="https://www.cs.cmu.edu/~biglou/">Luis von Ahn</a>, the inventor of reCAPTCHA and the founder of <a href="http://duolingo.com/">Duolingo</a>. He is a pioneer in this field. I wanted to study more about this field of research and hence I begin my systematic study of it. Picking up Luis’s papers was obviously a good starting point. In these posts, I’ll read a bunch of papers/articles and summarize them here along with my cerebrations, for my reference.</p>

<h2 id="featured-literature">Featured Literature</h2>

<ol>
  <li>Luis von Ahn. <strong>Games With A Purpose</strong>. IEEE Computer Magazine, June 2006. pp 96-98. [<a href="https://www.cs.cmu.edu/~biglou/ieee-gwap.pdf">pdf</a>]</li>
</ol>

<h2 id="matter">Matter</h2>

<h3 id="what-is-human-computation">What is Human Computation?</h3>

<ul>
  <li>Humans like to play games. By some estimates we are spending billions of hours playing computer games.</li>
  <li><strong>What if people playing computer games could, without consciously doing so, simultaneously solve large-scale problems?</strong></li>
  <li>If we treat human brains as processors in a distributed system and each can perform a small part of a massive computation.</li>
  <li><strong>But what kind of problems will we solve?</strong> Believe it or not, AGI is still not here. We still don’t have intuitive, intelligent machines. Current “intelligent” machines still don’t have skills that humans take for granted. But they are getting better and they’ll require clean and labelled data to do so.</li>
  <li>This is what <strong>Human Computation Paradigm</strong> is. Use humans to solve problems computers cant solve and help computers learn to solve them.</li>
</ul>

<h3 id="what-are-games-with-a-purpose">What are Games With a Purpose?</h3>

<p>Human computation is a great solution to the current scenario, but why isn’t it being used more? Because there’s a challenge that we have to attempt to solve first.</p>

<ul>
  <li>Humans require <strong>incentive</strong> to do any sort of work.</li>
  <li>Incentive can be anything: money, enjoyment, learning, altruism, being a part of a social network or anything else.</li>
  <li>Since paying money to make people solve these problems is not scalable; games enter the picture. They are a seductive method and also encourage players to use brain power to solve problems without you paying them any money. <strong>Enjoyment</strong> is the incentive being employed here.</li>
  <li>Such games which help us to collect information about tasks that are currently only easily and accurately solvable by humans are called <strong>Games With a Purpose</strong>.</li>
</ul>

<p>Great, right? To solve the data problem, you build games to make humans solve some tasks for you. Unsurprisingly, to build such a game <em>you</em> have to solve some tasks first.</p>

<ul>
  <li>Designing such a game is like designing an algorithm (input-output model)</li>
  <li>As results are the part you are building the GWAP for, ensure that game play results in the <strong>correct results</strong></li>
  <li>At the same time, it should be <strong>enjoyable</strong>.</li>
  <li>At last, the system should be such that it’s efficiency can be measured</li>
</ul>

<h3 id="proof-of-concept-games">Proof of Concept Games</h3>

<p>Below are the two games presented in the paper as case-studies..</p>

<ol>
  <li>
    <p>ESP Game (Labeling Images)</p>

    <ul>
      <li>2 player game which gets labels for an image.</li>
      <li>2 random users are matched - no communication between them, no identity is revealed.</li>
      <li>An image is given to both players and they have to come up with <em>same</em> labels, as many as possible.</li>
      <li>Score is dependent on the number of same labels they got; this ensures that they <strong>both agree</strong> on the image for that label.</li>
      <li>Image also has taboo words associated with it which users can’t use; so game also ensures that <strong>exhaustive labels</strong> come for the image.</li>
      <li>Users can skip an image if they want; ensures boring images or images for which <strong>no more labels</strong> can be obtained are skipped.</li>
    </ul>
  </li>
  <li>
    <p>Peekaboom (Locating objects in images)</p>

    <ul>
      <li>2 player game which gets the location of objects within an image.</li>
      <li>2 random users (called, Peek and Boom) are matched - no communication between them, no identity is revealed.</li>
      <li>Peek gets a blank screen and Boom gets an image and an associated word with it; these image and word pairs came from ESP game.</li>
      <li>Boom has to reveal some part of the image to Peek and peek has to guess the keyword. Since the keyword is something that is present in the image, Boom will obviously <strong>reveal those parts of the image where the object is located</strong>.</li>
      <li>Revealing part is done by clicking which creates a 20px radius circle.</li>
      <li>There is also a hinting system to enable Boom in helping Peek guess the word.</li>
      <li>Score is dependent on how less the image was revealed to Peek.</li>
      <li>Once the correct guess is made, the role of Peek and Boom switches</li>
      <li>Users can skip the image if they want.</li>
      <li>When enough guesses are made for a single image, combining all the sessions gives the complete location of the object in image by pixel.</li>
    </ul>
  </li>
</ol>

<p>There’s an interesting lecture by Luis on the parts discussed till now - <a href="https://youtu.be/tx082gDwGcM">human computation</a>.</p>

<h3 id="potential-games-suggested">Potential Games Suggested</h3>

<ul>
  <li>Language Translation
    <ul>
      <li>2 player game where both players speak different languages and the challenge is to translate text from one language to other</li>
      <li>He made something based on this - <a href="http://duolingo.com/">Duolingo</a>.</li>
    </ul>
  </li>
  <li>Monitoring of security cameras
    <ul>
      <li>Players could monitor security cameras and alert authorities about suspected illegal activity.</li>
      <li>This idea is very questionable; who knows, how it’ll really work without privacy breaches.</li>
    </ul>
  </li>
  <li>Improving web search
    <ul>
      <li>A game where players perform searches for other people.</li>
      <li>Today’s search engine technology is way too advanced for this to be helpful</li>
      <li>I guess, to search engines like google, bing and ddg, people are giving them helpful information in making their search technology better; so the concept is being used in some way.</li>
    </ul>
  </li>
  <li>Text summarization
    <ul>
      <li>A game in which people summarize important documents for the rest of the world</li>
      <li>Solving this would require proper game design and diving the text into small chunks.</li>
    </ul>
  </li>
</ul>

<p>I might have a lot more thoughts about the kind of applications that we can solve through GWAPs, but I need to crystallize them before dumping them out here.</p>

<p>Human computation (and GWAPs) and crowd-sourcing overlap quite a lot. Taking Wikipedia as an example, human computation is definitely involved in creating Wiki pages, editing them, reviewing them. But the incentive here is pure altruism. Volunteers are not getting paid to do any work on Wikipedia. But probably that’s why the “editor-base” of Wikipedia is so small. Only those who have motivation to contribute, work on it.</p>

<p>Whereas, GWAPs provide the incentive of enjoyment. Players (users) solve these tasks not because they want to solve them. It’s because they want to get the enjoyment. So the task they are solving needs to be ingrained with the game play.</p>

<p>So the only one and important question I’ll end this post with is, <strong>How can we design boring, data labeling tasks as games for the users to enjoy them?</strong></p>]]></content><author><name>Shivam Rana</name></author><category term="Data" /><category term="GWAP" /><summary type="html"><![CDATA[I first read the terms Human Computation and Games with a Purpose while reading about crowd sourcing. Further reading on these terms and a discussion with my boss led me to Luis von Ahn, the inventor of reCAPTCHA and the founder of Duolingo. He is a pioneer in this field. I wanted to study more about this field of research and hence I begin my systematic study of it. Picking up Luis’s papers was obviously a good starting point. In these posts, I’ll read a bunch of papers/articles and summarize them here along with my cerebrations, for my reference.]]></summary></entry><entry><title type="html">Fixed Perimeter + Maximum Area = Square</title><link href="https://trigonaminima.github.io/2019/12/grad-desc-area/" rel="alternate" type="text/html" title="Fixed Perimeter + Maximum Area = Square" /><published>2019-12-25T00:00:00+00:00</published><updated>2019-12-25T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2019/12/grad-desc-area</id><content type="html" xml:base="https://trigonaminima.github.io/2019/12/grad-desc-area/"><![CDATA[<p>I’d like to verify using <a href="https://en.wikipedia.org/wiki/Gradient_descent">Gradient Descent</a>, that given a perimeter value of a quadrilateral, <em>square</em> is the one with the maximum area. This can be verified/proved using various analytical methods, but my objective here was to verify it using Gradient Descent. You ask why? Because I wanted to do it. In the process, I got more than what I had hoped for. Here are some intuitive explanations of the problem I am trying to verify - <a href="https://math.stackexchange.com/q/1082474/467063">More room in a Square Room</a></p>

<p>Instead of directly jumping to the most general form, I decided to divide the verification in multiple levels. First, I verified the simplest case, where I assumed a lot of things and then moved on to the most generic case where all assumptions were eliminated. Simplest case was to first verify that out of all the possible rectangles for a given perimeter, square is the one rectangle having the maximum area. Here, the quadrilateral being a rectangle ensured that, all angles are of 90 degrees and opposite sides are equal. Then I tried the verification for all the parallelograms. In this case, each parallelogram has opposite sides equal (which, in turn, makes opposite angles equal). At the end, the most generic case should have been picked, that is, out of all the possible quadrilaterals of a given perimeter, square is the one having the maximum area, but I couldn’t even clear the parallelogram level. So in a way, this blog post is a log about my failure of not being able to complete the task. Notwithstanding this failure, I did learn/solidify a few things while doing this activty and I’d like them to be documented. There’s a lot of basic calculus equations mentioned during the explanations, but they are easy enough to understand. Dont let the usage of various calculus symbols fool you.</p>

<p>Let’s first understand how the Gradient Descent (GD) works. In simple terms, GD is used to minimize a function. Minimization means that it finds the inputs for which the function yields the lowest value. To find this value, it takes small steps in the direction of the <em>negative of the gradient</em> (or approximate gradient) of the function. And a gradient tells us about the steepness of a function at a particular point. So if we are taking steps in the negative direction of the gradient, we’ll take the fastest route to reach the lowest value of the function. Following is the GB algorithm:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Until convergence, repeat
2. Take gradient descent step for all variables
3. Update the variables with the new variables
</code></pre></div></div>

<p>For example, if we have a function, f(x,y) then the Gradient Descent will operate as follows</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Until f(x,y) is minimum, repeat
2. Gradient descent step
    1. x_new obtained after gradient descent step for x
    2. y_new obtained after gradient descent step for y
4. update x and y with x_new and y_new
</code></pre></div></div>

<p>In a gradient descent process, one thing to consider, is the learning rate, \(\alpha\). This controls how big or small a step to take in the direction of the negative gradient. A big \(\alpha\) will reach the lowest point quickly as you’ll take bigger steps, but it’ll not be as accurate or it might also fail to reach the minimum point. A small \(\alpha\) will take small steps to reach the minima, and hence, it’ll take longer to reach it, but it’ll be much more accurate. So there’s a tradeoff between quickness and accuracy. The gradient descent step where we can see \(\alpha\) in action is as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Take variable and alpha as input
2. new_variable = variable - alpha * gradient
3. return new_variable
</code></pre></div></div>

<p>Watch this <a href="https://youtu.be/rIVLE3condE">Gradient Descent</a> video by Andrew NG to get a more intuitive understanding of the process.</p>

<h2 id="rectangle">Rectangle</h2>

<p>The simplest case to start with was a Rectangle. All angles are of 90 degrees. Pairs of opposite sides are equal. The area of the rectangle depends on <code class="language-plaintext highlighter-rouge">length</code> and <code class="language-plaintext highlighter-rouge">breadth</code> so, we only had to optimize for these two variables. Our objective was to find the values of <code class="language-plaintext highlighter-rouge">length</code> and <code class="language-plaintext highlighter-rouge">breadth</code> for which the <code class="language-plaintext highlighter-rouge">area</code> will be <em>maximum</em> for a given <code class="language-plaintext highlighter-rouge">perimeter</code>. I first calculated the optimal dimensions using a brute force method. Then I solved it using the gradient descent.</p>

<h3 id="rectangle---brute-force">Rectangle - Brute Force</h3>

<p>In the brute force solution, I started with the smallest value and incremented it by a small and constant step size. It tries out all the possible values and selected the one which gave the greatest area. One optimization I employed was to just iterate for <code class="language-plaintext highlighter-rouge">length</code> and calculate <code class="language-plaintext highlighter-rouge">breadth</code> by subtracting <code class="language-plaintext highlighter-rouge">length</code> from the half of the <code class="language-plaintext highlighter-rouge">perimeter</code>, which we got as follows.</p>

<!-- https://tex.stackexchange.com/a/162542/118936 -->

\[\begin{alignat}{2}
    &amp;&amp;perimeter
    &amp;= 2 * (length + breadth)\\
    \Leftrightarrow\quad
    &amp;&amp;length + breadth
    &amp;= perimeter * 0.5
\end{alignat}\]

<p>Which gives,</p>

\[breadth = (perimeter * 0.5) - length\]

<p>This also ensured that we remained in the possible value range of <code class="language-plaintext highlighter-rouge">length</code> and <code class="language-plaintext highlighter-rouge">breadth</code>, and dont search beyond that.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">expected_optimum</span><span class="p">(</span><span class="n">perimeter</span><span class="p">):</span>
    <span class="s">"""
    Returns the dimensions - area and sides - of the square
    for the given perimeter.
    """</span>
    <span class="n">side</span> <span class="o">=</span> <span class="n">perimeter</span> <span class="o">/</span> <span class="mi">4</span>
    <span class="k">return</span> <span class="n">area</span><span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">),</span> <span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""l * b"""</span>
    <span class="k">return</span> <span class="n">length</span> <span class="o">*</span> <span class="n">breadth</span>


<span class="k">def</span> <span class="nf">perimeter</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""2 * l + 2 * b"""</span>
    <span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">length</span> <span class="o">+</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">find_optimum_brute_force</span><span class="p">(</span><span class="n">perimeter</span><span class="p">):</span>
    <span class="s">"""
    Finds the dimensions of the rectangle which have the
    maximum area for a given perimeter.

    It uses brute force to find the dimensions.
    """</span>
    <span class="n">half</span> <span class="o">=</span> <span class="n">perimeter</span> <span class="o">/</span> <span class="mi">2</span>
    <span class="n">step_size</span> <span class="o">=</span> <span class="mf">0.1</span>

    <span class="n">max_area</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">half</span><span class="p">,</span> <span class="n">step_size</span><span class="p">):</span>
        <span class="n">length</span> <span class="o">=</span> <span class="n">i</span>
        <span class="n">breadth</span> <span class="o">=</span> <span class="n">half</span> <span class="o">-</span> <span class="n">length</span>

        <span class="n">a</span> <span class="o">=</span> <span class="n">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">a</span> <span class="o">&gt;</span> <span class="n">max_area</span><span class="p">:</span>
            <span class="n">max_area</span> <span class="o">=</span> <span class="n">a</span>
            <span class="n">config</span> <span class="o">=</span> <span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">max_area</span><span class="p">,</span> <span class="n">config</span>

<span class="n">p</span> <span class="o">=</span> <span class="mi">36</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Expected</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">expected_optimum</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Brute Force</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">find_optimum_brute_force</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expected	:  (81.0, (9.0, 9.0))
Brute Force	:  (81.0, (9.0, 9.0))
</code></pre></div></div>

<p>Note that, I’ve used the above numbers to check the correctness of all of the implementations. For each method, I’ll give it a perimeter of <code class="language-plaintext highlighter-rouge">36 units</code> and expect the value of both the sides to be <code class="language-plaintext highlighter-rouge">9 units</code> as output.</p>

<p>Brute force solution worked perfectly. Below visualizations will give some insights into how this solution worked.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_11_0.png" alt="png" /></p>

<p>Above three plots show all the values that the brute force method checked to find the maximum area.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_13_0.png" alt="png" /></p>

<p>These three plots are the same as before, but here the values were only captured when the current area was found to be greater than the previous one.</p>

<p>Astute readers will notice that even though the optimum solution was right in the middle, brute force tried all the solutions before returning the middle solution as final result.</p>

<h3 id="rectange---gradient-descent-with-hack">Rectange - Gradient Descent (with Hack)</h3>

<p>Hack means that I used the same optimization as in brute force soltion - just iterating for <code class="language-plaintext highlighter-rouge">length</code> and calculating <code class="language-plaintext highlighter-rouge">breadth</code> by subtracting <code class="language-plaintext highlighter-rouge">length</code> from the half of <code class="language-plaintext highlighter-rouge">perimeter</code>. This way, Gradient Descent had to only optimize for a single variable. I’ve kept this one around for discussion, and later, I also discuss the one where optimization is done for both - <code class="language-plaintext highlighter-rouge">length</code> and <code class="language-plaintext highlighter-rouge">breadth</code>.</p>

<p>I have to maximize the area of the rectangle. The area is given by the following equation,</p>

\[area(length, breadth) = length * breadth ;\]

<p>The maximization equation can be written as-</p>

\[max_{length, breadth} area(length, breadth)\]

<p>It means that over all values of <code class="language-plaintext highlighter-rouge">length</code> and <code class="language-plaintext highlighter-rouge">breadth</code> find the maximum area. Gradient Descent can only find the minima, which prompted me to change the above maximization problem into a minimization one. So I introduced a constant term, which is the optimum area, <code class="language-plaintext highlighter-rouge">area_opt</code>. Optimum area is the area we want - that is, the area of the square. If you subtract the product of length and breadth (which is the area of rectangle) from the optimum area (assuming, I know this value; I’ll show that it doesn’t matter if we dont’t know this value), I will get a gap. If I minimize this gap, then I’ll eventually reach the optimum area using the gradient descent process. Thus, the new function to optimize for is-</p>

\[area\_gap(length, breadth) = area\_opt  - area(length, breadth) ;\]

<p>Using the above equation, the following minimization equation is written-</p>

\[min_{length, breadth} area\_gap(length, breadth)\]

<p>This is the final equation to optimize for using Gradient Descent. I started by taking the gradient (partial derivatives) of the <code class="language-plaintext highlighter-rouge">area_gap(length, breadth)</code>. Since I was only optimizing for length, I only had to take the derivative wrt length.</p>

\[\frac{\partial area\_gap}{\partial length} = -breadth\]

<p>Next is just to write the Gradient Descent step to optimize for length, which takes in the length, breadth and the learning rate (\(\alpha\)) and returns the new value of the length after moving in the direction of the negative of the gradient. I used this GD step function in a Gradient Descent loop to keep doing it until it arrived at the solution. Here’s the code.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">expected_optimum</span><span class="p">(</span><span class="n">perimeter</span><span class="p">):</span>
    <span class="s">"""
    Returns the dimensions - area and sides - of the square
    for the given perimeter.
    """</span>
    <span class="n">side</span> <span class="o">=</span> <span class="n">perimeter</span> <span class="o">/</span> <span class="mi">4</span>
    <span class="k">return</span> <span class="n">area</span><span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">),</span> <span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""l * b"""</span>
    <span class="k">return</span> <span class="n">length</span> <span class="o">*</span> <span class="n">breadth</span>


<span class="k">def</span> <span class="nf">perimeter</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""2 * l + 2 * b"""</span>
    <span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">length</span> <span class="o">+</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">grad_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""-b"""</span>
    <span class="k">return</span> <span class="o">-</span><span class="n">breadth</span>


<span class="k">def</span> <span class="nf">grad_desc_step</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lr</span><span class="p">):</span>
    <span class="s">"""l - lr * (-b)"""</span>
    <span class="k">return</span> <span class="n">length</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">grad_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">find_optimum_grad_desc</span><span class="p">(</span><span class="n">perimeter</span><span class="p">,</span> <span class="n">lr</span><span class="p">,</span> <span class="n">th</span><span class="p">,</span> <span class="n">init_opt</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
    <span class="n">half</span> <span class="o">=</span> <span class="n">perimeter</span> <span class="o">/</span> <span class="mi">2</span>

    <span class="n">length</span> <span class="o">=</span> <span class="n">init_length</span><span class="p">(</span><span class="n">init_opt</span><span class="p">,</span> <span class="n">half</span><span class="p">)</span>
    <span class="n">breadth</span> <span class="o">=</span> <span class="n">half</span> <span class="o">-</span> <span class="n">length</span>
    <span class="n">a</span> <span class="o">=</span> <span class="n">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>

    <span class="k">while</span> <span class="mi">1</span><span class="p">:</span>
        <span class="n">length</span> <span class="o">=</span> <span class="n">grad_desc_step</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lr</span><span class="p">)</span>
        <span class="n">breadth</span> <span class="o">=</span> <span class="n">half</span> <span class="o">-</span> <span class="n">length</span>
        <span class="n">a_next</span> <span class="o">=</span> <span class="n">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>

        <span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="n">a</span> <span class="o">-</span> <span class="n">a_next</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">th</span><span class="p">:</span>
            <span class="k">break</span>

        <span class="n">a</span> <span class="o">=</span> <span class="n">a_next</span>

    <span class="k">return</span> <span class="n">a</span><span class="p">,</span> <span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">init_length</span><span class="p">(</span><span class="n">opt</span><span class="p">,</span> <span class="n">half</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">opt</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="mi">0</span>
    <span class="k">elif</span> <span class="n">opt</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">100</span>
    <span class="k">elif</span> <span class="n">opt</span> <span class="o">==</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">half</span><span class="o">/</span><span class="mi">3</span>
    <span class="k">elif</span> <span class="n">opt</span> <span class="o">==</span> <span class="mi">3</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">half</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="mi">100</span>

</code></pre></div></div>

<p>I ran the above Gradient Descent based optimization func for different initial values of the length. The <code class="language-plaintext highlighter-rouge">init_length</code> function does that.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p</span> <span class="o">=</span> <span class="mi">36</span>
<span class="k">print</span><span class="p">(</span><span class="s">"When length is initialized to 0:"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Expected</span><span class="se">\t\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">expected_optimum</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Gradient Desc (l=0)</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">find_optimum_grad_desc</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.000001</span><span class="p">,</span> <span class="n">init_opt</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Gradient Desc (l=-100)</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">find_optimum_grad_desc</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.000001</span><span class="p">,</span> <span class="n">init_opt</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Gradient Desc (l=half/3): "</span><span class="p">,</span> <span class="n">find_optimum_grad_desc</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.000001</span><span class="p">,</span> <span class="n">init_opt</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Gradient Desc (l=half)</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">find_optimum_grad_desc</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.000001</span><span class="p">,</span> <span class="n">init_opt</span><span class="o">=</span><span class="mi">3</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Gradient Desc (l&gt;half)</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">find_optimum_grad_desc</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.000001</span><span class="p">,</span> <span class="n">init_opt</span><span class="o">=</span><span class="mi">5</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expected		:  (81.0, (9.0, 9.0))
Gradient Desc (l=0)	:  (80.99999998729646, (9.000787301320313, 8.999212698679687))
Gradient Desc (l=-100)	:  (80.99999992026922, (9.000617661847032, 8.999382338152968))
Gradient Desc (l=half/3):  (80.99999962882168, (9.00029081686697, 8.99970918313303))
Gradient Desc (l=half)	:  (0.0, (18.0, 0.0))
Gradient Desc (l&gt;half)	:  (-0.009998969262668187, (18.000555425602073, -0.0005554256020730008))
</code></pre></div></div>

<p>First thing to notice, there were float values which when rounded yielded the optimum solution. Second, except when length was initialized to the <code class="language-plaintext highlighter-rouge">&gt;=half</code>, the function had reached the optimum solution.</p>

<p>The thing that got reinforced here was that <strong>what you initialize your variable to, matters a lot</strong>. For some values, it’ll converge; for some, it’ll converge but slowly; for some it’ll converge, but the solution will not be the desired one; and for some cases it might not even reach a solution and get stuck in an infinite loop. These differences show up more in the visualizations below.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_23_0.png" alt="png" /></p>

<p>These plots are from the data when <code class="language-plaintext highlighter-rouge">length</code> was initialized to <code class="language-plaintext highlighter-rouge">0</code> (making breadth <code class="language-plaintext highlighter-rouge">18</code>). According to the above graphs, <code class="language-plaintext highlighter-rouge">length</code> moved from <code class="language-plaintext highlighter-rouge">0</code> to <code class="language-plaintext highlighter-rouge">9</code> during which <code class="language-plaintext highlighter-rouge">area</code> went from <code class="language-plaintext highlighter-rouge">0</code> to <code class="language-plaintext highlighter-rouge">81</code>. Note that in the third graph, the actual flow of line was backward as <code class="language-plaintext highlighter-rouge">breadth</code> went from <code class="language-plaintext highlighter-rouge">18</code> to <code class="language-plaintext highlighter-rouge">9</code> making the area go from <code class="language-plaintext highlighter-rouge">0</code> to <code class="language-plaintext highlighter-rouge">81</code>.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_25_0.png" alt="png" /></p>

<p>The <code class="language-plaintext highlighter-rouge">length</code> was initialized to <code class="language-plaintext highlighter-rouge">-100</code> in this case, which made the <code class="language-plaintext highlighter-rouge">breadth</code> to be <code class="language-plaintext highlighter-rouge">118</code>. From 2nd and 3rd plots it is clear that GD had to make a lot of steps before reaching the optimal solution.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_27_0.png" alt="png" /></p>

<p>In this case, I initialized the <code class="language-plaintext highlighter-rouge">length</code> to be 1/6th of the perimeter, which was <code class="language-plaintext highlighter-rouge">6</code>. This made it closer to the optimal solution of 9. This showed up in the number of steps GD had to take to reach the minima.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_29_0.png" alt="png" /></p>

<p>Here the length was initialized to the exact half of perimeter. This initialization was same as the optimal solution and as can be seen in the plots, GD didn’t even move. It had already converged to the solution.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_31_0.png" alt="png" /></p>

<p>I initialized <code class="language-plaintext highlighter-rouge">length</code> to be <code class="language-plaintext highlighter-rouge">100</code> here, making the <code class="language-plaintext highlighter-rouge">breadth</code> to be <code class="language-plaintext highlighter-rouge">-82</code>. First two plots are backward as length went from <code class="language-plaintext highlighter-rouge">100</code> to <code class="language-plaintext highlighter-rouge">18</code> and area with it from <code class="language-plaintext highlighter-rouge">100</code> to <code class="language-plaintext highlighter-rouge">0</code>. As can be seen, GD converged to unwanted and wrong results.</p>

<p>All of the above viz makes it clear how important weights initialization is. Weight initialization is one of the efficiency governing factors of the whole optimization process.</p>

<h3 id="rectange---gradient-descent-with-legrange-multiplier">Rectange - Gradient descent with Legrange Multiplier</h3>

<p>In the last section, I described the process where I only optimized for a single variable possible only because of the use of a trick. In this section, I discuss the part where I ran optimized for both the variables - <code class="language-plaintext highlighter-rouge">length</code> and <code class="language-plaintext highlighter-rouge">breadth</code> - using Gradient Descent. In trying to do so, I had to learn about another trick - <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Legrange Multiplier</a>. Let me explain why I needed to learn this new technique to make GD work for the answer.</p>

<p>I started the same way as before-</p>

<ul>
  <li>Consider the area of a rectangle;</li>
  <li>Convert it into a minimization problem which can be solved by the the gradient descent by finding the minima;</li>
  <li>Find the partial derivatives wrt to length and breadth;</li>
  <li>Implement gradient descent step using the \(\alpha\)</li>
  <li>Arrive at the minima</li>
</ul>

<p>Everything is correct here except I didn’t take into account that values of length and breadth can only lie in a fixed range. This fixed range is governed by the perimeter of the rectangle. Why we didn’t face this problem in the last section is because we used perimeter to arrive at the value of the breadth as we changed length. So the <em>constraint</em> of length and breadth being in permissible limits was inherently being satisfied. Whereas, here we didn’t set any constraints because of which Gradient Descent just couldn’t reach the solution. And that is what I observed when I implemented the above algo. It was always running indefinitely, no matter what the initial initilization was.</p>

<p>Let’s understand why this was happening. For the perimeter of <code class="language-plaintext highlighter-rouge">36</code>, both length and breadth should be <code class="language-plaintext highlighter-rouge">9</code> for the area to be maximum. I made a 3D plot of \(x*y\) (area of a rectangle) when values of <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> range from <code class="language-plaintext highlighter-rouge">-18</code> to <code class="language-plaintext highlighter-rouge">18</code>.</p>

<video muted="" controls="">
    <source src="https://trigonaminima.github.io/assets/2019-12/area_rect.mp4" type="video/mp4" />
</video>

<p>The z-axis represents the area and x and y-axes represent the length and breadth, respectively. As you increase the length and breadth, the area increases and it’ll keep going up. Play the gif to look at the curve from different angles to understand how that’s happening. So if there are no constraints on length and breadth then the curve will keep going up making Gradient Descent to indefinitely look for the maxima. Another interesting thing to note is, if you check the point <code class="language-plaintext highlighter-rouge">(0, 0, 0)</code>, it’s a relative maxima if you look at the curve from a certain point and a relative minima when seen from other. Such points are called <a href="https://en.wikipedia.org/wiki/Saddle_point"><em>saddle points</em></a>. This will come up in bit.</p>

<p>Lets look at the same plot with the constraints that both length and breadth have to lie between <code class="language-plaintext highlighter-rouge">0</code> and <code class="language-plaintext highlighter-rouge">9</code>.</p>

<video muted="" controls="">
    <source src="https://trigonaminima.github.io/assets/2019-12/area_rect2.mp4" type="video/mp4" />
</video>

<p>With this constrained view of the curve, it’s clear that the maxima of the area function (at <code class="language-plaintext highlighter-rouge">81</code> on z-axis) occurs when both length and breadth are equal to <code class="language-plaintext highlighter-rouge">9</code>. So Gradient Descent needs to take the constraint into account when optimizing.</p>

<p>Now to achieve this, I tried a lot of things. While doing the gradient descent step, I took the modulus of the update by 9. That didn’t work. I added one more condition in the loop which ensured that the sum of length and breadth was equal to the half of perimeter. This also produced wrong results. Then a friend suggested to use <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Legrange Multipliers</a> as it’s made to handle constrained optimization problems. And now comes the brief explanation of the legrange multipliers. This <a href="https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-single-constraint">Khan Academy: Lagrange multipliers, introduction</a> is very well written. I’ll briefly go over it for my own gain.</p>

<p>The technique of Legrange Multipliers is to find the maxima/minima of a function when there is some constraint applied on it. It basically converts the constrained problem into an unconstrained one which allows us to apply derivative test to find the possible solutions (also called, stationary points). The method of Legrange Multipliers works as follows:</p>

<ol>
  <li>We have a function, \(f(x)\) subject to some equality constraint, \(g(x) = 0\);</li>
  <li>We define a new function called Legrangian Function as, \(\mathcal{L}(x, \lambda) = f(x) - \lambda g(x)\);</li>
  <li>\(\lambda\) is another parameter that we are going to optimize for;</li>
  <li>Find the solutions (stationary points) of the \(\mathcal{L}\) function by using the derivative test;</li>
  <li>Plug in all the solutions obtained in 4th step in \(f(x)\) and you’ll get max and min values possible.</li>
</ol>

<p>This was all good if I was solving the equations analytically. I cannot use Gradient Descent method, which is a numerical optimization method, to solve this. The reason is, the maxima of \(f(x)\) becomes a saddle point of \(\mathcal{L}\) rather than a local maxima and Gradient Descent finds a local maxima. So I will have to make it into a minimization problem. According to this <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier#Example_4:_Numerical_optimization">example</a> from wiki page, if I extremize the square of the gradient of the Lagrangian, then I’ve turned it into a minimization problem. I dont understand why this works as this is where I lost my patience. Here is the set of equations I’ll need to find the solution through GD:</p>

<ul>
  <li>
    <p>Area of the rectangle is given by:</p>

\[area(length, breadth) = length * breadth ;\]
  </li>
  <li>
    <p>Area is subjected to the following constraint:</p>

\[\begin{alignat}{2}
      &amp;&amp;2*(length + breadth)
      &amp;= perimeter\\
      \Leftrightarrow\quad
      &amp;&amp;2*(length + breadth) - perimeter
      &amp;=0
  \end{alignat}\]

    <p>So \(g(x)\) can be written as,</p>

\[g(length, breadth) = 2*(length + breadth) - perimeter\]
  </li>
  <li>
    <p>Legrangian equuation from the \(f(x)\) and \(g(x)\) will be as follows:</p>

\[\begin{alignat}{2}
      \mathcal{L}(length, breadth, \lambda)
      &amp;= area(length, breadth) - \lambda g(length, breadth)\\
      &amp;=length * breadth - \lambda (2*(length + breadth) - perimeter)
      \end{alignat}\]
  </li>
  <li>
    <p>To convert the Legrangian into a minimization problem I’ll need gradients. Following are the Legrangian Derivatives wrt length, breadth and \(\lambda\):</p>

\[\begin{alignat}{2}
      \frac{\partial \mathcal{L}}{\partial length} &amp;= breadth - 2\lambda\\
      \frac{\partial \mathcal{L}}{\partial breadth} &amp;= length - 2\lambda\\
      \frac{\partial \mathcal{L}}{\partial \lambda} &amp;= perimeter - 2*(length + breadth)
  \end{alignat}\]
  </li>
  <li>
    <p>Now the eq. for the step I didn’t understand - sq of the gradient of the Legrangian - which I’ll call \(h(x)\):</p>

\[\begin{alignat}{2}
      h(length, breadth, \lambda) &amp;= \left( \frac{\partial \mathcal{L}}{\partial length} \right)^2 +
                                     \left(\frac{\partial \mathcal{L}}{\partial breadth}\right)^2 +
                                     \left(\frac{\partial \mathcal{L}}{\partial \lambda}\right)^2\\
                                  &amp;= (breadth - 2\lambda)^2 + (length - 2\lambda)^2 +\\&amp;\quad(perimeter - 2*(length + breadth))^2
  \end{alignat}\]
  </li>
  <li>
    <p>For Gradient Descent step, I need to calculate the gradients of the \(h(x)\):</p>

\[\begin{alignat}{2}
      \frac{\partial h}{\partial length} &amp;= 2*(length-2\lambda) + 2*(perimeter - 2*length - 2*breadth)*2\\
                                         &amp;= 2*(length-2\lambda) + 4*(perimeter - 2length - 2breadth)\\\\
      \frac{\partial h}{\partial breadth} &amp;= 2*(breadth-2\lambda) + 0 + 2*(perimeter - 2*length - 2*breadth)*2\\
                                          &amp;= 2*(breadth-2\lambda) + 4*(perimeter - 2*length - 2*breadth)\\\\
      \frac{\partial h}{\partial \lambda} &amp;= 2*(breadth - 2\lambda)*(-2) + 2*(length - 2\lambda)*(-2)\\
                                          &amp;= -4*(breadth-2\lambda) - 4*(length-2\lambda)
  \end{alignat}\]
  </li>
</ul>

<p>Using these above gradient equations, I implemented Gradient Descent steps and the final Gradient Descent loop. Below is the coded version of the above equations.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""l*b"""</span>
    <span class="k">return</span> <span class="n">length</span> <span class="o">*</span> <span class="n">breadth</span>


<span class="k">def</span> <span class="nf">grad_area_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""b"""</span>
    <span class="k">return</span> <span class="n">breadth</span>


<span class="k">def</span> <span class="nf">grad_area_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""l"""</span>
    <span class="k">return</span> <span class="n">length</span>


<span class="k">def</span> <span class="nf">perimeter</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""2*(l+b)"""</span>
    <span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">length</span> <span class="o">+</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">grad_peri</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">):</span>
    <span class="s">"""2"""</span>
    <span class="k">return</span> <span class="mi">2</span>


<span class="k">def</span> <span class="nf">expected_optimum</span><span class="p">(</span><span class="n">perimeter</span><span class="p">):</span>
    <span class="n">side</span> <span class="o">=</span> <span class="n">perimeter</span> <span class="o">/</span> <span class="mi">4</span>
    <span class="k">return</span> <span class="n">area</span><span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">),</span> <span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">legrangian</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">):</span>
    <span class="s">"""l*b-lambda*(2l+2b-perimeter)"""</span>
    <span class="k">return</span> <span class="n">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span> <span class="o">-</span> <span class="n">lambd</span> <span class="o">*</span> <span class="p">(</span><span class="n">perimeter</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span> <span class="o">-</span> <span class="n">peri</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">grad_leg_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">):</span>
    <span class="s">"""b-lambda*2"""</span>
    <span class="k">return</span> <span class="n">grad_area_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span> <span class="o">-</span> <span class="n">lambd</span> <span class="o">*</span> <span class="n">grad_peri</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">grad_leg_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">):</span>
    <span class="s">"""l-lambda*2"""</span>
    <span class="k">return</span> <span class="n">grad_area_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span> <span class="o">-</span> <span class="n">lambd</span> <span class="o">*</span> <span class="n">grad_peri</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">grad_leg_wrt_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">):</span>
    <span class="s">"""perimeter - 2l - 2b"""</span>
    <span class="k">return</span> <span class="n">peri</span> <span class="o">-</span> <span class="n">perimeter</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">leg_grad_sq_magnitude</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">):</span>
    <span class="s">"""(b-lambda*2)^2 + (l-lambda*2)^2 + (perimeter - 2l - 2b)^2"""</span>
    <span class="n">g1</span> <span class="o">=</span> <span class="n">grad_leg_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">)</span>
    <span class="n">g2</span> <span class="o">=</span> <span class="n">grad_leg_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">)</span>
    <span class="n">g3</span> <span class="o">=</span> <span class="n">grad_leg_wrt_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">g1</span> <span class="o">*</span> <span class="n">g1</span> <span class="o">+</span> <span class="n">g2</span> <span class="o">*</span> <span class="n">g2</span> <span class="o">+</span> <span class="n">g3</span> <span class="o">*</span> <span class="n">g3</span>


<span class="k">def</span> <span class="nf">grad_h_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">):</span>
    <span class="s">"""2*(l-lambda*2) + 2*(perimeter - 2l - 2b)*2"""</span>
    <span class="n">part1</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">grad_leg_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">)</span>
    <span class="n">part2</span> <span class="o">=</span> <span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="n">grad_leg_wrt_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">part1</span> <span class="o">+</span> <span class="n">part2</span>


<span class="k">def</span> <span class="nf">grad_h_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">):</span>
    <span class="s">"""2*(b-lambda*2) + 2*(perimeter - 2l - 2b)*2"""</span>
    <span class="n">part1</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">grad_leg_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">)</span>
    <span class="n">part2</span> <span class="o">=</span> <span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="n">grad_leg_wrt_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">part1</span> <span class="o">+</span> <span class="n">part2</span>


<span class="k">def</span> <span class="nf">grad_h_wrt_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">):</span>
    <span class="s">"""2*(b-lambda*2)*-2 + 2*(l-lambda*2)*-2"""</span>
    <span class="n">part1</span> <span class="o">=</span> <span class="o">-</span><span class="mi">4</span> <span class="o">*</span> <span class="n">grad_leg_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">)</span>
    <span class="n">part2</span> <span class="o">=</span> <span class="o">-</span><span class="mi">4</span> <span class="o">*</span> <span class="n">grad_leg_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">part1</span> <span class="o">+</span> <span class="n">part2</span>


<span class="k">def</span> <span class="nf">grad_desc_step_l</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">,</span> <span class="n">lr</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">length</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">grad_h_wrt_length</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">grad_desc_step_b</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">,</span> <span class="n">lr</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">breadth</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">grad_h_wrt_breadth</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">grad_desc_step_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">,</span> <span class="n">lr</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">lambd</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">grad_h_wrt_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">find_optimum_grad_desc</span><span class="p">(</span><span class="n">peri</span><span class="p">,</span> <span class="n">lr</span><span class="p">,</span> <span class="n">th</span><span class="p">,</span> <span class="n">inits</span><span class="p">):</span>
    <span class="n">half</span> <span class="o">=</span> <span class="n">peri</span> <span class="o">/</span> <span class="mi">2</span>
    <span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span> <span class="o">=</span> <span class="n">inits</span>

    <span class="k">while</span> <span class="nb">abs</span><span class="p">(</span><span class="n">leg_grad_sq_magnitude</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">))</span> <span class="o">&gt;</span> <span class="n">th</span><span class="p">:</span>
        <span class="n">length_nw</span> <span class="o">=</span> <span class="n">grad_desc_step_l</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">,</span> <span class="n">lr</span><span class="p">)</span>
        <span class="n">breadth_nw</span> <span class="o">=</span> <span class="n">grad_desc_step_b</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">,</span> <span class="n">lr</span><span class="p">)</span>
        <span class="n">lambd_nw</span> <span class="o">=</span> <span class="n">grad_desc_step_lambd</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">,</span> <span class="n">peri</span><span class="p">,</span> <span class="n">lr</span><span class="p">)</span>
        <span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span> <span class="o">=</span> <span class="n">length_nw</span><span class="p">,</span> <span class="n">breadth_nw</span><span class="p">,</span> <span class="n">lambd_nw</span>
    <span class="k">return</span> <span class="n">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">),</span> <span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">lambd</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">init_all</span><span class="p">(</span><span class="n">p</span><span class="p">):</span>
    <span class="n">half</span> <span class="o">=</span> <span class="n">p</span><span class="o">/</span><span class="mi">2</span>
    <span class="n">vals</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">100</span><span class="p">,</span> <span class="n">half</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">half</span><span class="p">,</span> <span class="mi">5</span><span class="o">*</span><span class="n">half</span><span class="p">]</span>
    <span class="n">lamb</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">100</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">.</span><span class="n">product</span><span class="p">(</span><span class="n">vals</span><span class="p">,</span> <span class="n">vals</span><span class="p">,</span> <span class="n">lamb</span><span class="p">)</span>


<span class="n">p</span> <span class="o">=</span> <span class="mi">36</span>
<span class="n">times</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Expected</span><span class="se">\t</span><span class="s">&gt; "</span><span class="p">,</span> <span class="n">expected_optimum</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">init_all</span><span class="p">(</span><span class="n">p</span><span class="p">):</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
    <span class="n">a</span><span class="p">,</span> <span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">lamb</span><span class="p">)</span> <span class="o">=</span> <span class="n">find_optimum_grad_desc</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
    <span class="n">diff</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t</span>
    <span class="n">times</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">diff</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="nb">round</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">9</span> <span class="ow">and</span> <span class="ow">not</span> <span class="nb">round</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">9</span> <span class="ow">and</span> <span class="ow">not</span> <span class="nb">round</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">81</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Gradient Desc </span><span class="se">\t</span><span class="s">&gt; l=</span><span class="si">{</span><span class="n">l</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">;</span><span class="se">\t</span><span class="s">b=</span><span class="si">{</span><span class="n">b</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">;</span><span class="se">\t</span><span class="s">     a=</span><span class="si">{</span><span class="n">a</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">;</span><span class="se">\t</span><span class="s">  time taken=</span><span class="si">{</span><span class="n">diff</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Gradient Desc </span><span class="se">\t</span><span class="s">&gt; l=</span><span class="si">{</span><span class="n">l</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">;</span><span class="se">\t</span><span class="s">b=</span><span class="si">{</span><span class="n">b</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">;</span><span class="se">\t</span><span class="s">     a=</span><span class="si">{</span><span class="n">a</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">times</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="nb">max</span><span class="p">(</span><span class="n">times</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expected	&gt;  (81.0, (9.0, 9.0))
Gradient Desc 	&gt; l=9.0;	b=9.0;	     a=81.0
(0.027690649032592773, (9.0, 9.0, 0)) (0.14217901229858398, (90.0, -100, -100))
</code></pre></div></div>

<p>Surprisingly, for every initialization of length, breadth and lambda, this method gave the optimal answer - length is <code class="language-plaintext highlighter-rouge">9</code>, breadth is <code class="language-plaintext highlighter-rouge">9</code> and area is <code class="language-plaintext highlighter-rouge">81</code>. In the last line, I have printed the initialization pairs for which we got minimum and maximum running times. Both the results are intuitive. When I initialize both length and breadth to 9 then it’s very very near to the maxima so not a lot of iterations were required to reach the solution. When length and breadth were initialized to 90 and -100, it was very far from the maxima thus required a more steps (and hence more time) to reach the maxima. Below are some visualizations recorded while training.</p>

<p>Initialization: length=9; breadth=9; \(\lambda\)=0</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_45_0.png" alt="png" /></p>

<p>Initialization: length=90; breadth=-100; \(\lambda\)=-100</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_46_0.png" alt="png" /></p>

<p>Initialization: length=18; breadth=-0; \(\lambda\)=-100</p>

<p><img src="https://trigonaminima.github.io/assets/2019-12/output_47_0.png" alt="png" /></p>

<p>These are some interesting looking plots when length, breadth and lambda were initialized to different values. I leave interpretation of these graphs to the readers.</p>

<h2 id="parallelogram">Parallelogram</h2>

<p>Parallelogram is any quadrilateral where opposite sides are equal and parallel. Since opposite sides are equal, opposite angles are also equal. And because opposite angles are equal, any two adjacent angles sum to a total of 180 degrees. This way, a rectangle is also a paralleogram - opposites sides are equal and parallel, opposite angles are of 90 degrees, and adjacent angles sum to 180 degrees.</p>

<p>Area of rectangle depends on three things - 2 adjacent sides and the angle between those two sides. We usually write, <code class="language-plaintext highlighter-rouge">base*height</code>, but to calculate the height you need the angle and the other base. These two articles should should be enough to show you why the following formula makes sense - <a href="https://www.khanacademy.org/math/basic-geo/basic-geo-area-and-perimeter/parallelogram-area/a/area-of-parallelogram">Area of parallelograms</a> and <a href="https://en.wikipedia.org/wiki/Parallelogram#Area_formula">Area Formula</a>.</p>

\[area(base_1, base_2, \theta) = base_1 * base_2 * sin(\theta)\]

<p>Perimeter is same as before - sum up all the sides.</p>

\[perimeter(base_1, base_2) = 2*(base_1 + base_2)\]

<p>So now I gotta optimize for 3 variables - \(base_1\), \(base_2\) and \(\theta\).</p>

<h3 id="parallelogram---brute-force">Parallelogram - Brute Force</h3>

<p>In the brute force solution, I do the same thing as was done for rectangle. Iterate through all combination of length, breadth and angles to find the one with the greatest area. I used the same optimization where for each value of <code class="language-plaintext highlighter-rouge">length</code> and I calculated the <code class="language-plaintext highlighter-rouge">breadth</code> by using the formula of the <code class="language-plaintext highlighter-rouge">perimeter</code>.</p>

\[breadth = (perimeter * 0.5) - length\]

<p>Here’s the code</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">area</span><span class="p">(</span><span class="n">base1</span><span class="p">,</span> <span class="n">base2</span><span class="p">,</span> <span class="n">base1_base2_angle</span><span class="o">=</span><span class="mi">90</span><span class="p">):</span>
    <span class="s">"""b1 * b2 * sin(theta)"""</span>
    <span class="n">height</span> <span class="o">=</span> <span class="n">base2</span> <span class="o">*</span> <span class="n">math</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">radians</span><span class="p">(</span><span class="n">base1_base2_angle</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">base1</span> <span class="o">*</span> <span class="n">height</span>


<span class="k">def</span> <span class="nf">perimeter</span><span class="p">(</span><span class="n">base1</span><span class="p">,</span> <span class="n">base2</span><span class="p">):</span>
    <span class="s">"""2 * b1 + 2 * b2"""</span>
    <span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">base1</span> <span class="o">+</span> <span class="n">base2</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">expected_optimum</span><span class="p">(</span><span class="n">perimeter</span><span class="p">):</span>
    <span class="s">"""
    Returns the dimensions - area and sides - of the square
    for the given perimeter.
    """</span>
    <span class="n">side</span> <span class="o">=</span> <span class="n">perimeter</span> <span class="o">/</span> <span class="mi">4</span>
    <span class="k">return</span> <span class="n">area</span><span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">),</span> <span class="p">(</span><span class="n">side</span><span class="p">,</span> <span class="n">side</span><span class="p">,</span> <span class="mi">90</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">find_optimum_brute_force</span><span class="p">(</span><span class="n">perimeter</span><span class="p">,</span> <span class="n">plot_flag</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
    <span class="s">"""
    Finds the dimensions of the parallelogram which have the
    maximum area for a given perimeter.

    It uses brute force to find the dimensions.
    """</span>
    <span class="n">half</span> <span class="o">=</span> <span class="n">perimeter</span> <span class="o">/</span> <span class="mi">2</span>
    <span class="n">step_size</span> <span class="o">=</span> <span class="mf">0.1</span>

    <span class="n">max_area</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">config</span> <span class="o">=</span> <span class="p">()</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">half</span><span class="p">,</span> <span class="n">step_size</span><span class="p">):</span>
        <span class="n">length</span> <span class="o">=</span> <span class="n">i</span>
        <span class="n">breadth</span> <span class="o">=</span> <span class="n">half</span> <span class="o">-</span> <span class="n">length</span>

        <span class="k">for</span> <span class="n">angle</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">90.5</span><span class="p">,</span> <span class="n">step_size</span><span class="p">):</span>

            <span class="n">a</span> <span class="o">=</span> <span class="n">area</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">angle</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">a</span> <span class="o">&gt;</span> <span class="n">max_area</span><span class="p">:</span>
                <span class="n">max_area</span> <span class="o">=</span> <span class="n">a</span>
                <span class="n">config</span> <span class="o">=</span> <span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">breadth</span><span class="p">,</span> <span class="n">angle</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">max_area</span><span class="p">,</span> <span class="n">config</span>


<span class="n">p</span> <span class="o">=</span> <span class="mi">36</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Expected</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">expected_optimum</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Brute Force</span><span class="se">\t</span><span class="s">: "</span><span class="p">,</span> <span class="n">find_optimum_brute_force</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expected	:  (81.0, (9.0, 9.0, 90))
Brute Force	:  (81.0, (9.0, 9.0, 90.0))
</code></pre></div></div>

<p>So along with the sides being <code class="language-plaintext highlighter-rouge">9 units</code>, the function also needs to find that the angle is <code class="language-plaintext highlighter-rouge">90 degrees</code>. Which it did.</p>

<h3 id="parallelogram---gradient-descent-with-legrangian">Parallelogram - Gradient descent with legrangian</h3>

<p>This is where I couldn’t solve it using GD. I used the exact same procedure as I did in the case of rectangle, but it didn’t work. I’ll reproduce the equations here-</p>

<ul>
  <li>
    <p>Area of the rectangle is given by:</p>

\[area(base_1, base_2, \theta) = base_1 * base_2 * sin(\theta) ;\]
  </li>
  <li>
    <p>Area is subjected to the following constraints:</p>

    <ol>
      <li>
        <p>Perimeter constraint</p>

\[\begin{alignat}{2}
     &amp;&amp;2*(base_1 + base_2)
     &amp;= perimeter\\
     \Leftrightarrow\quad
     &amp;&amp;2*(base_1 + base_2) - perimeter
     &amp;=0
 \end{alignat}\]

        <p>So \(g_1(x)\) can be written as,</p>

\[g_1(base_1, base_2) = 2*(base_1 + base_2) - perimeter\]
      </li>
      <li>
        <p>Angle constraint</p>

\[\begin{alignat}{2}
     &amp;&amp;\theta
     &amp;&lt;= 180\\
     \Leftrightarrow\quad
     &amp;&amp;\theta - 180
     &amp;&lt;= 0
 \end{alignat}\]

        <p>So \(g_2(x)\) can be written as,</p>

\[g_2(\theta) = \theta - 180\]
      </li>
    </ol>
  </li>
  <li>
    <p>Legrangian equuation from the \(f(x)\), \(g_1(x)\) and \(g_2(x)\) will be as follows:</p>

\[\begin{alignat}{2}
      \mathcal{L}(base_1 + base_2, \theta, \lambda_1, \lambda_2)
      &amp;= area(base_1, base_2, \theta) - \lambda_1 g_1(base_1, base_2) - \lambda_2 g_2(\theta)\\
      &amp;= base_1 * base_2 * sin(\theta) - \lambda_1 (2*(base_1 + base_2) - perimeter)\\&amp;\quad- \lambda_2 (\theta - 180)
      \end{alignat}\]
  </li>
  <li>
    <p>To convert the Legrangian into a minimization problem I’ll need gradients. Following are the Legrangian Derivatives wrt length, breadth, \(\theta\), \(\lambda_1\) and \(\lambda_2\):</p>

\[\begin{alignat}{2}
      \frac{\partial \mathcal{L}}{\partial base_1} &amp;= base_2 * sin(\theta) - 2\lambda_1\\
      \frac{\partial \mathcal{L}}{\partial base_2} &amp;= base_1 * sin(\theta) - 2\lambda_1\\
      \frac{\partial \mathcal{L}}{\partial \theta} &amp;= base_1 * base_2 * cos(\theta) - \lambda_2\\
      \frac{\partial \mathcal{L}}{\partial \lambda_1} &amp;= 2*(base_1 + base_2) - perimeter\\
      \frac{\partial \mathcal{L}}{\partial \lambda_2} &amp;= \theta - 180
  \end{alignat}\]
  </li>
  <li>
    <p>Now the eq. for the sq of the gradient of the Legrangian - which I’ll call \(h(x)\):</p>

\[\begin{alignat}{2}
      h(length, breadth, \lambda) &amp;= \left( \frac{\partial \mathcal{L}}{\partial base_1} \right)^2 +
                                     \left( \frac{\partial \mathcal{L}}{\partial base_2} \right)^2 +
                                     \left( \frac{\partial \mathcal{L}}{\partial \theta} \right)^2 +
                                     \left( \frac{\partial \mathcal{L}}{\partial \lambda_1} \right)^2 +
                                     \left( \frac{\partial \mathcal{L}}{\partial \lambda_2} \right)^2\\
                                  &amp;= (base_2 * sin(\theta) - 2\lambda_1)^2 + (base_1 * sin(\theta) - 2\lambda_1)^2 +\\
                                  &amp; \quad \enspace (base_1 * base_2 * cos(\theta) - \lambda_2)^2 +\\
                                  &amp; \quad \enspace (2*(base_1 + base_2) - perimeter)^2 +\\
                                  &amp; \quad \enspace (\theta - 180)^2
  \end{alignat}\]
  </li>
  <li>
    <p>For Gradient Descent step, I need to calculate the gradients of the \(h(x)\):</p>

\[\begin{alignat}{2}
      \frac{\partial h}{\partial base_1} &amp;= 2 * (base_1 * sin(\theta) - 2\lambda_1) * sin(\theta) +\\
                                          &amp;  \quad \enspace 2 * (base_1 * base_2 * cos(\theta) - \lambda_2) * base_2 * cos(\theta) +\\
                                          &amp;  \quad \enspace 4 * (2*(base_1 + base_2) - perimeter)\\\\
      \frac{\partial h}{\partial base_2} &amp;= 2 * (base_2 * sin(\theta) - 2\lambda_1) * sin(\theta) +\\
                                          &amp;  \quad \enspace 2 * (base_1 * base_2 * cos(\theta) - \lambda_2) * base_1 * cos(\theta) +\\
                                          &amp;  \quad \enspace 4 * (2*(base_1 + base_2) - perimeter)\\\\
      \frac{\partial h}{\partial \theta} &amp;= 2 * (base_2 * sin(\theta) - 2\lambda_1) * base_2 * cos(\theta) +\\
                                          &amp;  \quad \enspace 2 * (base_1 * sin(\theta) - 2\lambda_1) * base_1 * cos(\theta) +\\
                                          &amp;  \quad \enspace 2 * (base_1 * base_2 * cos(\theta) - \lambda_2) * base_1 * base_2 * -1 * sin(\theta) +\\
                                          &amp;  \quad \enspace 2 * (\theta - 180)\\\\
      \frac{\partial h}{\partial \lambda_1} &amp;= -4*(base_2 * sin(\theta) - 2\lambda_1)-4*(base_1 * sin(\theta) - 2\lambda_1)\\\\
      \frac{\partial h}{\partial \lambda_2} &amp;= -2*(base_1 * base_2 * cos(\theta) - \lambda_2)
  \end{alignat}\]
  </li>
</ul>

<p>Using these above gradient equations, I implemented Gradient Descent steps and the final Gradient Descent loop. Although, looking at the output, it was clear that I did something wrong. The implementation can be found at - <a href="https://github.com/TrigonaMinima/Notebooks/blob/master/Gradient%20Descent%20-%20Maximum%20Area.ipynb">Gradient Descent - Maximum Area</a></p>

<p>At first I thought, my equations were wrong. I checked them 3-4 times and couldn’t find mistakes. If you notice something wrong, then please mention it on GH or Twitter.</p>

<p>Then I looked at the angle constraint. It’s an inequality, \(\theta &lt;= 180\). Plus, I also have to ensure another condition, \(\theta &gt;= 0\), which I didn’t even consider in the current implementation. If you look at the definition of the Legrange Multipliers (<a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">first para</a>), it talks about the equality constraint applied on the function we are maximizing. Whereas, what we have here are inequality constraints. There is a discussion about solving <a href="https://math.stackexchange.com/q/49473/467063">Lagrange multipliers with inequality constraints</a> on Math SE. There is an answer that suggests, I think, of solving the problem by taking one constraint of \(\theta\) at a time. So I’ll need to first solve for \(\theta = 180\) and then for \(\theta = 0\). Later by comparing the solutions I can arrive at a maxima and minima within the constraints. There is also a mention of <a href="https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions">KKT Conditions</a> which seems to be a generalization of Legrange Multipliers with inequality constraints as well.</p>

<p>Another direction can be to change the definition of my area. Any two adjacent angles of a paralleogram sum up to 180 degrees. So, if I use both the angles to compute the area, then I can use the equality constraint of adding both the angles to 180 degrees, \(\theta_1 + \theta_2 = 180\). This way I have replaced the inequality constraint with equality constraint enabling me to use Legrange Multiplier without any modifications. The modified area can be written as,</p>

\[area(base_1, base_2, \theta_1m \theta_2) = 0.5*base_1*base_2*sin(\theta_1)+0.5*base_1*base_2*sin(180-\theta_2)\]

<p>Another thing I found was when optimizing the Legrange Multipier using Gradient Descent, instead of using the sq of the gradient of the Legrangian, I could have just used the dual Gradient Descent, that is, using Gradient Descent minimize \(\mathcal{L}\) for \(x\) using some random value of \(\lambda\) and, then using Gradient Ascent maximize \(\mathcal{L}\) for \(\lambda\) using optimzed \(x\) just obtained. This method is also called alternating Gradient Descent. I don’t understand why this works exactly, but I think, it relates to the Legrangian duality. These articles - <a href="https://medium.com/@jonathan_hui/machine-learning-lagrange-multiplier-dual-decomposition-4afe66158c9">Machine Learning — Lagrange multiplier &amp; Dual decomposition</a> and <a href="https://medium.com/@jonathan_hui/rl-dual-gradient-descent-fac524c1f049">RL — Dual Gradient Descent</a> - touch on this a bit.</p>

<p>I didn’t try any of the above things. All of this is future work.</p>

<h2 id="interesting-links">Interesting Links</h2>

<p>Here are some of the interesting links I found while working on this exercise.</p>

<h3 id="problem">Problem</h3>

<ul>
  <li><a href="https://math.stackexchange.com/q/1082474/467063">Why is there more room in a square room than there is in a rectangular room when the perimeter is the same in both rooms?</a></li>
</ul>

<h3 id="3-d-plotting">3-D Plotting</h3>

<ul>
  <li><a href="https://academo.org/demos/3d-surface-plotter/?expression=x*y&amp;xRange=-18%2C+18&amp;yRange=-18%2C+18&amp;resolution=68">3D Surface Plotter Webapp</a></li>
  <li><a href="https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html">Three-Dimensional Plotting in Matplotlib</a></li>
  <li><a href="https://stackoverflow.com/a/18345457/2650427">Animating a Matplotlib 3D Graph</a></li>
  <li><a href="https://jakevdp.github.io/blog/2012/08/18/matplotlib-animation-tutorial/">Matplotlib Animation Tutorial</a></li>
  <li>Embedding Matplotlib animations in Jupyter: <a href="https://stackoverflow.com/q/43445103/2650427">SO Question 1</a>, <a href="https://stackoverflow.com/q/35532498/2650427">SO Question 2</a> and <a href="http://louistiao.me/posts/notebooks/embedding-matplotlib-animations-in-jupyter-notebooks/">this link</a></li>
</ul>

<h3 id="legrange-multipliers">Legrange Multipliers</h3>

<ul>
  <li><a href="https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-single-constraint">Khan Academy: Lagrange multipliers, introduction</a></li>
  <li><a href="https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-examples">Khan Academy: Lagrange multipliers, examples</a></li>
  <li><a href="https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/interpretation-of-lagrange-multipliers">Khan Academy: Interpretation of Lagrange multipliers</a></li>
  <li><a href="https://www.alexirpan.com/2019/07/27/lagrange-multipliers.html">A Lagrange Multipliers Refresher, For Idiots Like Me</a></li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="ML" /><summary type="html"><![CDATA[I’d like to verify using Gradient Descent, that given a perimeter value of a quadrilateral, square is the one with the maximum area. This can be verified/proved using various analytical methods, but my objective here was to verify it using Gradient Descent. You ask why? Because I wanted to do it. In the process, I got more than what I had hoped for. Here are some intuitive explanations of the problem I am trying to verify - More room in a Square Room]]></summary></entry><entry><title type="html">Wikidata for Transliteration Pairs</title><link href="https://trigonaminima.github.io/2019/11/transliteration-wikidata/" rel="alternate" type="text/html" title="Wikidata for Transliteration Pairs" /><published>2019-11-07T00:00:00+00:00</published><updated>2019-11-07T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2019/11/transliteration-wikidata</id><content type="html" xml:base="https://trigonaminima.github.io/2019/11/transliteration-wikidata/"><![CDATA[<p>Many researchers use Wikipedia as a source of data to train NLP models. The public domain nature of the crowd-sourced articles on wide-ranging topics in multiple languages opens up a lot of possibilities for research. There are multiple modes of data extraction from Wikipedia:</p>

<ul>
  <li><a href="https://wiki.dbpedia.org/">DBpedia</a> type data where a community has written tools to extract the structured data of the <a href="https://en.wikipedia.org/wiki/Infobox">infoboxes</a> of the Wikipedia pages;</li>
  <li>Article texts of the Wikipedia articles;</li>
  <li>Metadata of each Wikipedia article containing - title, description, aliases, title in many languages, and a lot more other information.</li>
</ul>

<p>This data can be used for multiple purposes. Article text can be used to train word embeddings like <a href="https://en.wikipedia.org/wiki/Word2vec">Word2vec</a>, <a href="https://nlp.stanford.edu/projects/glove/">Glove</a>, <a href="https://github.com/zalandoresearch/flair">Flair</a> etc. Article text can also be used to train language models like <a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">BERT</a>, <a href="https://openai.com/blog/better-language-models/">GPT2</a>, <a href="https://arxiv.org/abs/1801.06146">ULMFiT</a>, <a href="https://github.com/zihangdai/xlnet">XLNet</a>, <a href="http://nlp.fast.ai/">MultiFiT</a> etc. Structured data can be used to do disambiguation, entity recognition, translation, build knowledge graphs and to solve a wide variety of other NLP problems. Metadata can be used for many purposes. Transliteration is one of them. In this article I’ll describe how I was able to quickly create a training dataset for a <a href="https://en.wikipedia.org/wiki/Transliteration">transliteration</a> model. With this process, I extracted more than 87K unique English-Hindi Transliteration pairs (a source string and a target string). To read more about transliteration you can read this <a href="/2018/06/hinglish-and-transliteration/">very short introduction</a> to transliteration of Hindi or Romanized Hindi or Hinglish.</p>

<h1 id="data-source">Data Source</h1>

<p>This <a href="https://www.wikidata.org/wiki/Wikidata:Database_download/en">download page</a> will give you all the information on the Wikidata and the different data formats available. I downloaded the Wikidata JSON dump, which is also the recommended way, from the <a href="https://dumps.wikimedia.org/wikidatawiki/entities/">following index page</a>. The snapshot (<code class="language-plaintext highlighter-rouge">latest-all.json.bz2</code>) I downloaded was created on <code class="language-plaintext highlighter-rouge">08-Oct-2019 11:29</code> and reached around 38GB in size on my HDD.</p>

<p>A few introductory articles about the Wikidata and the data dumps which helped me were:</p>

<ul>
  <li><a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">Wikidata Home Page</a></li>
  <li><a href="https://topicseed.com/blog/importing-wikidata-dumps#updating-wikidata-dumps">Importing Wikidata Dumps - The Easy Part</a></li>
</ul>

<h1 id="pre-processing">Pre-processing</h1>

<p>Before I decided to download the JSON dump, I had to see how can I process the data on my system using Python. After reading about the data format a bit, I learnt that Wikidata JSON dump is just a very big text file where each line is the JSON representation of the an item on Wikimedia sites. And this text file is zipped. So my processing logic flow was-</p>

<ul>
  <li>Open the zip file;</li>
  <li>Open the text file inside;</li>
  <li>Read each line of the file;</li>
  <li>Parse the JSON string into a python dictionary;</li>
  <li>Extract the data;</li>
  <li>Save the data.</li>
</ul>

<p>Since I didn’t read in detail about the data format, to judge if the above process will fail, before starting with the code, I googled if a module is already present to process the dump. The <a href="https://www.wikidata.org/wiki/Wikidata:Tools/For_programmers">Tools for Programmers</a> page gave me the name of the Python module that’ll help me with this - <a href="https://github.com/kensho-technologies/qwikidata/">qwikidata</a>. That is what I really like about Python community: there is always a module to solve your problem.</p>

<p>The format of our file will be an English string followed by a pipe (|) followed by a Hindi string - <code class="language-plaintext highlighter-rouge">left_string|right_string</code>. Following are three random lines from the final created dataset.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>adinath|आदिनाथ
adipur|आदिपुर
adipurana|आदिपुराण
</code></pre></div></div>

<p>I’ll interchangeably use the terms <em>en string</em>, <em>left string</em> or <em>source string</em> for the English string. Similarly, <em>hi string</em>, <em>right string</em> or <em>target string</em> for the Hindi string.</p>

<h2 id="extraction">Extraction</h2>

<p>The <a href="https://github.com/kensho-technologies/qwikidata/blob/master/examples/basic_json_dump.py">basic_json_dump.py</a> in the examples folder of the <code class="language-plaintext highlighter-rouge">qwikidata</code> Githib repo got me started with the processing of the JSON dump. With a quick look at the <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON">data model page</a>, I gathered that the top level fields - <code class="language-plaintext highlighter-rouge">label</code>, <code class="language-plaintext highlighter-rouge">description</code> and <code class="language-plaintext highlighter-rouge">alias</code> - contain the language filter. In the final script, I used these three fields to extract the transliteration pairs. I didn’t study the <code class="language-plaintext highlighter-rouge">claims</code> field, but I suspect that I missed some pairs data there.</p>

<p>The final JSON dump processing code is here - <a href="https://github.com/TrigonaMinima/HinglishNLP/blob/master/datagen/wikidata2.py">datagen/wikidata2.py</a>. Logic flow is same as I described earlier, but done using the <code class="language-plaintext highlighter-rouge">qwikidata</code> methods.</p>

<ul>
  <li>Open the zip file and iterate through the file, done with the help of <a href="https://qwikidata.readthedocs.io/en/stable/qwikidata.json_dump.html#qwikidata.json_dump.WikidataJsonDump">WikidataJsonDump</a> class which reads the zip and gives you an iterator over the lines of the file;</li>
  <li>Parse the JSON strings using <a href="https://qwikidata.readthedocs.io/en/stable/qwikidata.entity.html#qwikidata.entity.WikidataItem">WikidataItem</a> and <a href="https://qwikidata.readthedocs.io/en/stable/qwikidata.entity.html#qwikidata.entity.WikidataProperty">WikidataProperty</a> classes;</li>
  <li>Extract the English and Hindi versions of <code class="language-plaintext highlighter-rouge">label</code>, <code class="language-plaintext highlighter-rouge">description</code> and <code class="language-plaintext highlighter-rouge">alias</code> make them into pipe (|) separated strings;</li>
  <li>Dump each pair in a file.</li>
</ul>

<p>At the end of this extraction process, I had a ~500MB output text file (lets call it <code class="language-plaintext highlighter-rouge">pairs.txt</code>) from the 38GB Wikidata JSON dump. Each line was pipe (|) separated en and hi strings as we established at the start of this section.</p>

<p>This <code class="language-plaintext highlighter-rouge">pairs.txt</code> contained a raft of transliteration pairs which was what I needed. I just had to get rid of all the noisy data. Now comes the divide and conquer strategy. Break your problem into small chunks and solve them independently. To create these small subproblems, I had to look into the data.</p>

<p>Vscode took some time to open the <code class="language-plaintext highlighter-rouge">pairs.txt</code>. First thing was to eliminate the completely useless rows. Brace yourself, a lot of regular expressions are going to be introduced now.</p>

<ul>
  <li>
    <p>If the source and target strings are same then that means that both the scripts are same in both the strings like numbers. Another reason can be that there are issues with the data and either both the strings are in Roman script or in Devanagari stript. Thus, it is a useless row for us. Replaced all such lines with blank using this regex - <code class="language-plaintext highlighter-rouge">^(.*)\|\1$\n</code>.</p>
  </li>
  <li>
    <p>If we don’t have any target string for the source string then that line is also useless for us. This again is the issue with the source data. This regex removes the lines having a blank on the right side of the pipe: <code class="language-plaintext highlighter-rouge">^.*\|$\n</code>.</p>
  </li>
  <li>
    <p>The opposite of the previous case will also be invalid for us, that is, the examples where we have the target string and not the source string. I eliminated those rows using this regex: <code class="language-plaintext highlighter-rouge">^\|.*$\n</code>.</p>
  </li>
  <li>
    <p>After this, I removed the rows where both left and right strings were in roman form, that is, they were in English. This regex helped with that - <code class="language-plaintext highlighter-rouge">^[a-z \-0-9/\(\)\.]+\|[a-z \-0-9/\(\)\.]+$\n</code></p>
  </li>
</ul>

<p>This removed a lot of junk. Note that, all of these steps could have been coded in the extraction script, but at the time of writing the script, I didn’t think too much about all such cases. I wrote the script and ran it and then just went out for a few hours (it had to process a 38GB file without any parallel processing).</p>

<p>Now looking at the data, I saw many valid pairs were those where there were no spaces anywhere in the line. Lets call this set <code class="language-plaintext highlighter-rouge">pairs1</code>. The following grep command separated these rows into the <code class="language-plaintext highlighter-rouge">pairs1.txt</code> for me-</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="nt">-Ei</span> <span class="s2">"^[^ ]+</span><span class="se">\|</span><span class="s2">[^ ]+$"</span> pairs.txt <span class="o">&gt;</span> pairs1.txt
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-i</code> flag is ignore-case flag. Add an <code class="language-plaintext highlighter-rouge">-v</code> flag in the above command and you’ll get all the non-matching lines.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="nt">-Eiv</span> <span class="s2">"^[^ ]+</span><span class="se">\|</span><span class="s2">[^ ]+$"</span> pairs.txt <span class="o">&gt;</span> pairs_temp.txt
<span class="nb">mv </span>pairs_temp.txt pairs.txt
</code></pre></div></div>

<p>Now <code class="language-plaintext highlighter-rouge">pairs.txt</code> only contains lines having at least one space in it.</p>

<h2 id="automated-sifting">Automated Sifting</h2>

<p>Another set of valid pairs, calling it - <code class="language-plaintext highlighter-rouge">pairs2</code>, was the rows where spaces were equal on both the sides of the pipe or number of space separated words were equal on both sides. I have made an assumption that all such pairs are word-by-word transliteration. Let’s understand by examples:</p>

<ol>
  <li>
    <p>Consider the pair: <code class="language-plaintext highlighter-rouge">tale of two cities|टेल ऑफ टू सिटिज़</code>. Here, there are four words on both sides of pipe and each English word is parallelly transliterated in Hindi. <code class="language-plaintext highlighter-rouge">टेल</code> is the transliteration of <code class="language-plaintext highlighter-rouge">tale</code>; <code class="language-plaintext highlighter-rouge">ऑफ</code> is the transliteration of <code class="language-plaintext highlighter-rouge">of</code>; <code class="language-plaintext highlighter-rouge">टू</code> is the transliteration of <code class="language-plaintext highlighter-rouge">two</code> and <code class="language-plaintext highlighter-rouge">सिटिज़</code> is the transliteration of <code class="language-plaintext highlighter-rouge">cities</code>.</p>
  </li>
  <li>
    <p>On the contrary, consider this pair: <code class="language-plaintext highlighter-rouge">middle kingdoms of india|भारत के मध्य साम्राज्य</code>. In this pair, even though both sides have same number of words, none of them are correct transliteration pairs when taken in parallel. <code class="language-plaintext highlighter-rouge">भारत</code> is not a transliteration of <code class="language-plaintext highlighter-rouge">middle</code>; <code class="language-plaintext highlighter-rouge">के</code> is not a transliteration of <code class="language-plaintext highlighter-rouge">kingdoms</code>; <code class="language-plaintext highlighter-rouge">मध्य</code> is not a transliteration of <code class="language-plaintext highlighter-rouge">of</code> and <code class="language-plaintext highlighter-rouge">साम्राज्य</code> is not a transliteration of <code class="language-plaintext highlighter-rouge">india</code>.</p>
  </li>
</ol>

<p>My assumption in extracting <code class="language-plaintext highlighter-rouge">pairs2</code> is that all the pairs are valid as in the 1st example. Once I have identified such rows, I create a list of such parallel transliterations and dump them to the <code class="language-plaintext highlighter-rouge">pairs2.txt</code>. So for both of the above examples, following eight lines will be added to the <code class="language-plaintext highlighter-rouge">pairs2.txt</code>-</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">tale|टेल</code></li>
  <li><code class="language-plaintext highlighter-rouge">of|ऑफ</code></li>
  <li><code class="language-plaintext highlighter-rouge">two|टू</code></li>
  <li><code class="language-plaintext highlighter-rouge">cities|सिटिज़</code></li>
  <li><code class="language-plaintext highlighter-rouge">middle|भारत</code></li>
  <li><code class="language-plaintext highlighter-rouge">kingdoms|के</code></li>
  <li><code class="language-plaintext highlighter-rouge">of|मध्य</code></li>
  <li><code class="language-plaintext highlighter-rouge">india|साम्राज्य</code></li>
</ol>

<p>This whole thing is covered by this small python script - <a href="https://github.com/TrigonaMinima/HinglishNLP/blob/master/datagen/wiki_trans_align.py">datagen/wiki_trans_align.py</a>. The <code class="language-plaintext highlighter-rouge">align_on_words</code> function (<code class="language-plaintext highlighter-rouge">line 8</code>) defines that logic of selecting if a particular line is in <code class="language-plaintext highlighter-rouge">pairs2</code> set. In the same file, <code class="language-plaintext highlighter-rouge">line 38</code> created a list of parallel transliteration same as in the above list.</p>

<p>Now the remaining rows are the ones where the spaces are unequal on both sides of the pipe. For such rows, since I couldn’t find any particular pattern, I created the transliteration pairs by taking cross-product of the list of words for both the source and the target strings.</p>

<p>If we have 2 lists - <code class="language-plaintext highlighter-rouge">[1, 2]</code> and <code class="language-plaintext highlighter-rouge">[3, 4, 5]</code>: then their cross-product will be - <code class="language-plaintext highlighter-rouge">[(1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5)]</code>. So if we have the following pair - <code class="language-plaintext highlighter-rouge">line of control|नियंत्रण रेखा</code> then we’ll get the following <code class="language-plaintext highlighter-rouge">6</code> (3 words from the left and 2 words from the right) transliteration pairs:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">line|नियंत्रण</code></li>
  <li><code class="language-plaintext highlighter-rouge">line|रेखा</code></li>
  <li><code class="language-plaintext highlighter-rouge">of|नियंत्रण</code></li>
  <li><code class="language-plaintext highlighter-rouge">of|रेखा</code></li>
  <li><code class="language-plaintext highlighter-rouge">control|नियंत्रण</code></li>
  <li><code class="language-plaintext highlighter-rouge">control|रेखा</code></li>
</ol>

<p>We write all such cross pairs in the <code class="language-plaintext highlighter-rouge">pairs3.txt</code> file. The <em>divide</em> part of the <em>divide and conquer strategy</em> is complete. Lets start with the <em>conquering</em>. We’ll call the final file with the cleaned pairs, <code class="language-plaintext highlighter-rouge">pairs_final.txt</code>.</p>

<p>In order to find a quick way to separate all the valid cases, I used a few heuristics:</p>

<ol>
  <li>
    <p>Created an ad-hoc transliteration function. It uses mappings of every Devanagari character to possible Roman characters. These codified mappings can be seen here - <a href="https://github.com/TrigonaMinima/HinglishNLP/blob/master/datagen/utils/transliterate.py#L15">datagen/utils/transliterate.py:L15</a>. This script was written by a friend to be used for some other purpose (check out this blog entry for details - <a href="/2018/10/chatbot/">(Mis)adventures of Building a Chat Bot</a>). Using this function, I generated the set of possible transliterations (<code class="language-plaintext highlighter-rouge">transliterations</code>) of the Hindi word (<code class="language-plaintext highlighter-rouge">hi</code>) in every pair. If the English word (<code class="language-plaintext highlighter-rouge">en</code>) from the pair lies in <code class="language-plaintext highlighter-rouge">transliterations</code>, this pair goes to <code class="language-plaintext highlighter-rouge">true.txt</code>. If this was unsuccessful then, next check is of the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> between <code class="language-plaintext highlighter-rouge">en</code> and all the transliterations in <code class="language-plaintext highlighter-rouge">transliterations</code> to be greater than <code class="language-plaintext highlighter-rouge">0.85</code>. If it is true, then this pair goes to <code class="language-plaintext highlighter-rouge">true.txt</code>. If this test also fails, then we check for the <code class="language-plaintext highlighter-rouge">max</code> levenshtein distance to be less than <code class="language-plaintext highlighter-rouge">0.5</code>. If true then it goes to <code class="language-plaintext highlighter-rouge">v_false.txt</code>. If all the conditions fail then the pair is dumped in <code class="language-plaintext highlighter-rouge">false.txt</code>. In simple words, if a few conditions are passed then the pair will be considered almost true; if it definitely fails a few conditions then it is assumed to be almost wrong; all the remaining ones are uncategorized. The thresholds, <code class="language-plaintext highlighter-rouge">0.85</code> and <code class="language-plaintext highlighter-rouge">0.5</code> were decided after trying various other values.</p>
  </li>
  <li>
    <p>Upon observation, I saw that, in most of the correct pairs, the difference in lengths of English and Hindi words was under <code class="language-plaintext highlighter-rouge">3</code>. So another heuristic was to put all such pairs where difference was not under 3, in <code class="language-plaintext highlighter-rouge">v_false.txt</code> and the remaining ones in <code class="language-plaintext highlighter-rouge">false.txt</code>.</p>
  </li>
  <li>
    <p>Created a filter function to extract the pairs where I took the frequent English word endings and mapped with their corresponding Hindi word endings. The mappings I created are here: <a href="https://github.com/TrigonaMinima/HinglishNLP/blob/master/datagen/wiki_trans_filter.py#L62">datagen/wiki_trans_filter.py:L62</a>. I put all these extracted words in <code class="language-plaintext highlighter-rouge">true.txt</code> and the remaining ones in <code class="language-plaintext highlighter-rouge">false.txt</code>.</p>
  </li>
</ol>

<p>The implementation of all these heuristics are in the following file: <a href="https://github.com/TrigonaMinima/HinglishNLP/blob/master/datagen/wiki_trans_filter.py">datagen/wiki_trans_filter.py</a>.</p>

<p>Since there were three files having transliteration pairs - <code class="language-plaintext highlighter-rouge">pairs1.txt</code>, <code class="language-plaintext highlighter-rouge">pairs2.txt</code> and <code class="language-plaintext highlighter-rouge">pairs3.txt</code> - for each file, I created <code class="language-plaintext highlighter-rouge">true.txt</code>, <code class="language-plaintext highlighter-rouge">false.txt</code> and <code class="language-plaintext highlighter-rouge">v_false.txt</code>. There’s a high degree of confidence for most of the pairs in <code class="language-plaintext highlighter-rouge">true.txt</code> to be correct. Similarly, in <code class="language-plaintext highlighter-rouge">v_false.txt</code> most of the pairs would be wrong. Whereas, <code class="language-plaintext highlighter-rouge">false.txt</code> demanded more scrutiny. The filtered results (<code class="language-plaintext highlighter-rouge">true.txt</code>, <code class="language-plaintext highlighter-rouge">false.txt</code> and <code class="language-plaintext highlighter-rouge">v_false.txt</code>) created from the pairs of <code class="language-plaintext highlighter-rouge">pairs3.txt</code>, which was the noisiest file due to the cross product, also required more careful winnowing.</p>

<p>Here ended our automated step.</p>

<h2 id="manual">Manual</h2>

<p>There was nothing special about the manual process. Just going through all the rows and segregating all the valid ones. A few things helped increase the pace of the manual process.</p>

<p>As predicted, <code class="language-plaintext highlighter-rouge">true.txt</code> had mostly correct pairs. There were very few wrong pairs, but mostly, the processing of the true files finished very quickly. The same was true with the <code class="language-plaintext highlighter-rouge">v_false.txt</code>. Most of the pairs were categorically wrong. Only the <code class="language-plaintext highlighter-rouge">v_false.txt</code> file generated from the <code class="language-plaintext highlighter-rouge">pairs3.txt</code> contained many valid pairs and took some time to go through.</p>

<p>Most time, as noted earlier, was consumed by <code class="language-plaintext highlighter-rouge">false.txt</code>. I can’t talk about the exact distribution, but if I had to guess, it could be in the ratio of 40/60 with correct to wrong pairs. While working on the these files, I realised that working on the sorted file will be much faster because, after sorting, you can quickly apply binary search sort of method for each pair to find the correct row and eliminate the others. This method worked because of the</p>

<ul>
  <li>Presence of a lot of duplicate pairs; and the</li>
  <li>Presence of varying transliterations for a single English word.</li>
</ul>

<p>So sorting by English words brought all the duplicates, varying (but correct), as well as, wrong transliterations for each en word together. Thus helping with quick elimination of wrong pairs. As each file was finished, the correct pairs were being added to the <code class="language-plaintext highlighter-rouge">pairs_final.txt</code>. Thus, giving us the final dataset in the end.</p>

<h1 id="data-stats">Data Stats</h1>

<p>Let’s look at some of the data stats:</p>

<p><br /></p>

<div class="rendered_html">
<table style="margin-left: 0.5cm;">
  <tr>
    <th>Statistic</th>
    <th>Value</th>
  </tr>
  <tr>
    <td>Extracted Pairs</td>
    <td><span style="font-weight:normal">217,393</span></td>
  </tr>
  <tr>
    <td>Unique Pairs</td>
    <td><span style="font-weight:normal">87,873</span></td>
  </tr>
  <tr>
    <td><span style="font-weight:normal">English words</span></td>
    <td><span style="font-weight:normal">70,835</span></td>
  </tr>
  <tr>
    <td><span style="font-weight:normal">Hindi words</span></td>
    <td><span style="font-weight:normal">75,434</span></td>
  </tr>
</table>
</div>

<p>After going through this activity, I googled for transliteration datasets available in public domain. Here’s the final list I was able to create - <a href="https://github.com/TrigonaMinima/HinglishNLP/blob/master/data/transliteration/">data/transliteration</a>. If you take a look at all the datasets, then you’ll find that the largest dataset was contains around 70k unique transliteration pairs. That alone makes this Wiki dataset largest. I have also not deduped the dataset.</p>

<p>In the last two rows of the table, you can see that Hindi words are more than English words. This shows that for some English words there are multiple Hindi words (or transliterations), that is, for one English word there can be multiple transliterations.</p>

<p><br /></p>

<div class="rendered_html">
<table style="margin-left: 0.5cm;">
  <tr>
    <th>Statistic</th>
    <th>Min</th>
    <th>Max</th>
    <th>Mean</th>
    <th>Median</th>
    <th>Std</th>
  </tr>
  <tr>
    <td>Word Length (En)</td>
    <td>1</td>
    <td>33</td>
    <td>6.3</td>
    <td>6.0</td>
    <td>2.2</td>
  </tr>
  <tr>
    <td>Word Length (Hi)</td>
    <td>1</td>
    <td>33</td>
    <td>5.8</td>
    <td>6.0</td>
    <td>2.2</td>
  </tr>
</table>
</div>

<p>Most of the descriptive stats are same for both English and Hindi words. This might be because of one-to-one mapping between English and Hindi sounds because this dataset is essentially, English dictionary words (Roman Script) written in Hindi (Devanagari Script). Or this might be because of something else which I haven’t observed or understood.</p>

<h1 id="conclusion">Conclusion</h1>

<p>As explained in this blog post, using Wikidata, I was able to extract a decent number of training samples to train a neural transliteration model or to do any other analysis. The whole process took me around 3-4 days with 2-3 hours of work each day. That’s not a lot of time. Plus it was just a single person doing the task. Plus it was obtained for free, from crowd-sourced data without any use of platforms like mturk. And since this data is crowd-sourced and checked, revised by many volunteers, the transliterations can also be assumed to be of excellent quality.</p>

<p>Wikipedia (and Wikidata) is a great source of data. It’s upon us to find ways to extract data for our ML models. Now that I have this data, next step is to train a seq2seq model and see how it’s doing on some unseen data (eg. <a href="/2018/10/chatbot/">chats</a>).</p>]]></content><author><name>Shivam Rana</name></author><category term="NLP" /><category term="Data" /><summary type="html"><![CDATA[Many researchers use Wikipedia as a source of data to train NLP models. The public domain nature of the crowd-sourced articles on wide-ranging topics in multiple languages opens up a lot of possibilities for research. There are multiple modes of data extraction from Wikipedia:]]></summary></entry><entry><title type="html">Seq2Seq Components</title><link href="https://trigonaminima.github.io/2019/09/seq2seq-components/" rel="alternate" type="text/html" title="Seq2Seq Components" /><published>2019-09-19T00:00:00+00:00</published><updated>2019-09-19T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2019/09/seq2seq-components</id><content type="html" xml:base="https://trigonaminima.github.io/2019/09/seq2seq-components/"><![CDATA[<p>Real world data is almost always in bad shape. You have to clean it properly to make any use of it. Cleaning becomes more important if this is your training data for a machine learning model. And these problems especially become worse if you are dealing with short text. I am talking about the text generated on platforms like Twitter, Facebook, YouTube, Instagram, WhatsApp, Telegram etc. These platforms and many others, also give searching capabilities, which is another source of such short textual data. We make a lot of mistakes. It’s common to make typos or use non-standard (short) forms of various words. It’s also common to have non-standard abbreviations by the interlocutors. It’s also common for multilingual users to use romanized forms of their native language mixed in with the English text (phenomenon called, <a href="https://en.wikipedia.org/wiki/Code-switching">code switching</a>). I discovered many of these issues when I worked on a Telegram chatbot to do something fun with the conversations happening. Here are details on <a href="/2018/10/chatbot/">what all I did</a>.</p>

<p>Working on Hinglish data opened up a lot of directions for me to explore. Two major things I wanted to accomplish were to get <strong>word embeddings</strong> and build a <strong>language model</strong> (LM) for Hinglish. Other downstream tasks can be done using the LM or the embeddings. Now to achieve these two tasks, I’ll either need a LOT of data or I can be smart about it by using transfer learning and get done with it by using small amount of data. Although, I’ll still need some amount of Hinglish data to fine tune the model. Another issue was to identify which word was from which language, basically breaking down the code switching participants. I read a lot of language identification literature from Indian researchers, mainly from Indian Unis and Microsoft Research, India. Most of the publications were 3-4 years old and based on statistical machine learning approaches with not a very good accuracy and required manual creation of training data which, definitely, is not scalable. Moreover, they were all just identifying the language - whether Hindi or English, and not back-transliterating. I needed something that’d do both the tasks without a lot of (possibly, without any) manual steps. <strong>Seq2Seq</strong> and <strong>Transformer</strong> models are trending these days with great strides in NMT, LM and Word Embeddings. Turns out, my task can be solved by Seq2Seq or Transformers. Transform one sequence into another without any manual feature engineering. I started with Seq2Seq.</p>

<p>So that was the introduction. This post is a prelude to the more important article about Seq2Seq and Transformer architectures and finally how they were employed to solve my specific use-case. When I started reading the implementation of Seq2Seq architectures, I was getting lost around the shapes of data passing from one layer to the other. In this post, I’ll explain the major components of a Seq2Seq: what they do, what goes in, what comes out and what are the shapes of these inputs and outputs. It is a dump of what I understood for my future reference. I’ll <em>not</em> be discussing the internal implementation of these layers. For that, I’ll point to other much better resources.</p>

<h2 id="contents">Contents</h2>

<ol>
  <li><a href="#embedding">Embedding Layer</a></li>
  <li><a href="#dropout">Dropout Layer</a></li>
  <li><a href="#lstm">LSTM Layer</a></li>
  <li><a href="#stacked_lstm">Stacked LSTM Layer</a></li>
  <li><a href="#bi_lstm">Birectional LSTM Layer</a></li>
  <li><a href="#gru">GRU Layer</a></li>
  <li><a href="#lin">Linear Layer</a></li>
</ol>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">torch</span>

<span class="kn">from</span> <span class="nn">IPython.core.interactiveshell</span> <span class="kn">import</span> <span class="n">InteractiveShell</span>
<span class="n">InteractiveShell</span><span class="p">.</span><span class="n">ast_node_interactivity</span> <span class="o">=</span> <span class="s">"all"</span>
</pre></td></tr></tbody></table></code></pre></figure>

<h2 id="i-embedding-layer-">I. Embedding Layer <a name="embedding"></a></h2>

<p>Lets start with embedding layer. Embedding layer converts a token (read word) into a vector of some fixed length, in some latent space. This is the layer which gives a numerical representation to a string or word or phrase whatever we define a token to be. This token belongs to a fixed dictionary. You can also call the Embedding layer a lookup table where you get the embeddings of any token that you search. All the word embeddings being released - <a href="https://en.wikipedia.org/wiki/Word2vec">Word2vec</a>, <a href="https://nlp.stanford.edu/projects/glove/">Glove</a>, <a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">Bert</a>, <a href="https://github.com/zihangdai/xlnet">XLNet</a> - in the wild are basically, dump of these fixed length vectors of each word of the vocabulary in a text file.</p>

<p>An an illustration, the following table depicts an embedding layer. Each cell will contain a floating point number which is learned during the training process. Each row belongs to a particular token (or word). This particular embedding layer has a vocabulary of size 6 and fixed embedding dimension of size <code class="language-plaintext highlighter-rouge">9</code>.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-09/embedding_layer.jpg" /></p>

<p>Let’s see it in action through pytorch. When we’ll pass the input through the embedding layer, we’ll get a vector associated with each token.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="c1"># Size of the dictionary
</span><span class="n">input_vocab_dim</span> <span class="o">=</span> <span class="mi">5</span>

<span class="c1"># Size of embedding
</span><span class="n">embedding_dim</span> <span class="o">=</span> <span class="mi">20</span>

<span class="c1"># Defining the embedding layer
</span><span class="n">embedding_layer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">input_vocab_dim</span><span class="p">,</span> <span class="n">embedding_dim</span><span class="p">)</span>
<span class="n">embedding_layer</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Embedding(5, 20)
</code></pre></div></div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># A single sentence having 5 tokens
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="c1"># getting the vector for each token using the embedding layer
</span><span class="n">out</span> <span class="o">=</span> <span class="n">embedding_layer</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([5])
output shape	: torch.Size([5, 20])
tensor([[-1.9582,  0.3694, -0.0891,  0.1013, -0.9002, -0.8440, -1.4958,  0.5866,
         -1.3941, -0.5076, -0.0253,  0.2521, -0.2686,  0.8453,  0.7316, -0.5722,
          0.8934,  0.0096,  0.1266, -0.1782],
        [ 0.8200,  2.9792,  2.5127,  1.4575, -0.3157,  1.3983, -0.3372, -1.3391,
         -2.6474,  1.7839,  0.6624,  0.7945, -0.1209,  0.3479, -0.5690, -0.0209,
         -0.3274,  0.5088, -0.6134,  0.6749],
        [-0.2992, -0.8587, -0.5895,  0.6793,  1.4242,  1.0162, -0.1293, -0.1920,
         -0.1430,  0.0995, -0.6117, -0.9773,  1.0624,  0.6776, -0.8358, -1.1597,
         -1.9794,  0.3239, -0.0553,  1.7774],
        [-0.3628, -2.0340,  0.3843, -0.9901,  0.3058,  1.7236,  0.0779,  0.7414,
          0.2678,  0.0504,  1.2800,  0.0690, -0.4973, -0.4386, -0.4045, -0.1839,
          1.1021,  0.1258, -0.0121, -2.9886],
        [-0.5037, -0.5081,  0.9251,  0.5090, -1.7795,  0.7403, -0.8271, -0.0379,
          1.3416,  0.2089,  0.5362,  0.5990,  0.3598, -0.0691, -0.6134,  0.5894,
         -1.6609,  1.3345,  0.1430, -1.1380]], grad_fn=&lt;EmbeddingBackward&gt;)
</code></pre></div></div>

<p>Here we got a vector of size <code class="language-plaintext highlighter-rouge">20</code> for each of the <code class="language-plaintext highlighter-rouge">5</code> tokens. So, here we have a matrix of size <code class="language-plaintext highlighter-rouge">5X20</code>. Five values in our input tensor <code class="language-plaintext highlighter-rouge">[1, 2, 3, 4, 0]</code> are basically the row indices for which we want a <code class="language-plaintext highlighter-rouge">20</code> values long from the embedding layer. Now do you see the the <strong>lookup table</strong>?</p>

<p>Note: The numbers that came out are random numbers the layer was initialized with when defined using <code class="language-plaintext highlighter-rouge">nn.Embedding</code>. To make your experiment or results reproducible you should <strong>set the random seed</strong> before starting any model implementation.</p>

<p>This was a single sentence, now let’s see how to send in a batch of sentences.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="c1"># Batch of 4 sentences having 5 tokens each.
# Batch size is 4. Max sentence len is 5
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">])</span>
<span class="n">inp</span> <span class="o">=</span> <span class="n">inp</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">).</span><span class="n">repeat</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span> <span class="o">=</span> <span class="n">embedding_layer</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">out</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">,</span> <span class="p">:])</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([5, 4])
output shape	: torch.Size([5, 4, 20])
tensor([[-1.9582,  0.3694, -0.0891,  0.1013, -0.9002, -0.8440, -1.4958,  0.5866,
         -1.3941, -0.5076, -0.0253,  0.2521, -0.2686,  0.8453,  0.7316, -0.5722,
          0.8934,  0.0096,  0.1266, -0.1782],
        [ 0.8200,  2.9792,  2.5127,  1.4575, -0.3157,  1.3983, -0.3372, -1.3391,
         -2.6474,  1.7839,  0.6624,  0.7945, -0.1209,  0.3479, -0.5690, -0.0209,
         -0.3274,  0.5088, -0.6134,  0.6749],
        [-0.2992, -0.8587, -0.5895,  0.6793,  1.4242,  1.0162, -0.1293, -0.1920,
         -0.1430,  0.0995, -0.6117, -0.9773,  1.0624,  0.6776, -0.8358, -1.1597,
         -1.9794,  0.3239, -0.0553,  1.7774],
        [-0.3628, -2.0340,  0.3843, -0.9901,  0.3058,  1.7236,  0.0779,  0.7414,
          0.2678,  0.0504,  1.2800,  0.0690, -0.4973, -0.4386, -0.4045, -0.1839,
          1.1021,  0.1258, -0.0121, -2.9886],
        [-0.5037, -0.5081,  0.9251,  0.5090, -1.7795,  0.7403, -0.8271, -0.0379,
          1.3416,  0.2089,  0.5362,  0.5990,  0.3598, -0.0691, -0.6134,  0.5894,
         -1.6609,  1.3345,  0.1430, -1.1380]], grad_fn=&lt;SliceBackward&gt;)i
</code></pre></div></div>

<p>Here we got a vector of size <code class="language-plaintext highlighter-rouge">20</code> for each of the <code class="language-plaintext highlighter-rouge">5</code> tokens for each of the <code class="language-plaintext highlighter-rouge">4</code> sentences in the batch. Here since we have the same sentence repeated <code class="language-plaintext highlighter-rouge">4</code> times, we’ll have the same vector representation repeated <code class="language-plaintext highlighter-rouge">4</code> times. You can see the printed values are same for the above and the previous outputs.</p>

<p>Shape summary is as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In Shape  : (*)
Out Shape : (*, H)
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">H</code> is the embedding size and <code class="language-plaintext highlighter-rouge">*</code> means any general shape.</p>

<h2 id="ii-dropout-layer-">II. Dropout Layer <a name="dropout"></a></h2>

<p>Next up is the Dropout Layer. This is a very important layer for a neural network. Due to it’s regularization effect it prevents overfitting. Here’s some <a href="https://www.coursera.org/lecture/deep-neural-network/understanding-dropout-YaGbR">intuition on why</a>. This layer, based on the probability given by us, randomly turns some of the elements into zeroes. Here’s the demonstration:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="n">dropout_probab</span> <span class="o">=</span> <span class="mf">0.5</span>

<span class="n">dropout</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">dropout_probab</span><span class="p">)</span>
<span class="n">dropout</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Dropout(p=0.5, inplace=False)
</code></pre></div></div>

<p>Dropout layer defined which will randomly turn around 50% of the values into zero.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="c1"># A random tensor with 8 values
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input tensor</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">)</span>

<span class="c1"># getting the vector for each token using the embedding layer
</span><span class="n">out</span> <span class="o">=</span> <span class="n">dropout</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output tensor</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([8])
input tensor	: tensor([ 0.5486,  1.0968, -0.6814,  0.5524, -1.9662, -0.0759,  1.0506,  1.1233])

output shape	: torch.Size([8])
output tensor	: tensor([ 1.0971,  2.1935, -0.0000,  0.0000, -0.0000, -0.1517,  0.0000,  2.2466])
</code></pre></div></div>

<p>Here, we have zeroed approximately 50% values. If you look carefully at the non-zero values, they are different than the original values. That’s because, dropout layer during training, scales non-zero values by \(\frac{1}{(1-p)}\), where \(p\) is the probability. For <code class="language-plaintext highlighter-rouge">p = 0.5</code>,</p>

\[\frac{1}{(1-p)} = \frac{1}{(1-0.5)} = \frac{1}{0.5} = 2\]

<p>So, after zeroing 50% values from the input tensor, every non-zero value is scaled by <code class="language-plaintext highlighter-rouge">2</code>.</p>

<p>Let’s see what we’ll get after we pass a tensor with a different shape.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="c1"># a random tensor with 10 values
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="c1"># getting the vector for each token using the embedding layer
</span><span class="n">out</span> <span class="o">=</span> <span class="n">dropout</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Total cells in input</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="mi">5</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Zero values in input</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">inp</span><span class="o">==</span><span class="mf">0.00</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Zero values in output</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">out</span><span class="o">==</span><span class="mf">0.00</span><span class="p">))</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([5, 4, 20])
output shape	: torch.Size([5, 4, 20])

Total cells in input	: 400
Zero values in input	: tensor(0)
Zero values in output	: tensor(207)
</code></pre></div></div>

<p>So, the shapes remain same. That makes sense, as the dropout layer only zeroes the values and doesn’t do any other processing or operations. But it nuked around 50% of the values.</p>

<p>Shape summary of the layer:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In Size  : (*)
Out Size : (*)
</code></pre></div></div>

<p>where <code class="language-plaintext highlighter-rouge">*</code> means any given shape</p>

<h2 id="iii-lstm-layer-">III. LSTM Layer <a name="lstm"></a></h2>

<p>This is a big one. LSTM, short for, Long Short Term Memory is a type of Recurrent Neural Network (RNN). I am not going to talk about the internals of LSTMs here. For that, this lucid blog post titled <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a> by Chris Olah is there. I’ll consider LSTM as a black box and build my explanation on top of that. A LSTM unit uses previous context or hidden states along with the input to produce some output and context. Here’s my visualization of a <em>rolled</em> LSTM network.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-09/lstm.jpg" alt="lstm" /></p>

<p>Lets unpack all the symbols in the figure:</p>

<ol>
  <li>\(I_1\) is the input at time \(t_1\) having a size of <code class="language-plaintext highlighter-rouge">input_dim</code>;</li>
  <li>\(H_1\) is the hidden state generated at time \(t_1\) having a size of <code class="language-plaintext highlighter-rouge">hidden_dim</code>;</li>
  <li>\(C_1\) is the cell state generated at time \(t_1\) having a size of <code class="language-plaintext highlighter-rouge">hidden_dim</code>;</li>
  <li>\(O_1\) is output generated at time \(t_1\) having a size of <code class="language-plaintext highlighter-rouge">hidden_dim</code> (it is same as \(H_1\));</li>
  <li>\(H_0\) and \(C_0\) are the hidden and cell states generated at time \(t_0\) having a size of <code class="language-plaintext highlighter-rouge">hidden_dim</code> (when we start the network training, these states are initialised with <code class="language-plaintext highlighter-rouge">0</code>).</li>
  <li>Each LSTM unit maintains a hidden state, \(H_n\) and cell state, \(C_n\) and since LSTM is a recurrent unit, it takes the hidden (\(H_0\)) and cell state from the previous time step (\(C_0\)) as input (or internal configuration or context) and uses them to generate the current output (\(O_1\)), hidden (\(H_1\)) and cell states (\(C_1\));</li>
</ol>

<p>Below is the unrolled look of the above network.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-09/lstm2.jpg" /></p>

<p>Outputs from previous time step become the input for the next time step along with the actual input of that time step.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">input_dim</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">hidden_dim</span> <span class="o">=</span> <span class="mi">15</span>

<span class="n">lstm</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">)</span>
<span class="n">lstm</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LSTM(5, 15)
</code></pre></div></div>

<p>In our LSTM definition, it takes an input of size <code class="language-plaintext highlighter-rouge">5</code> (not exactly <code class="language-plaintext highlighter-rouge">5</code>, but I’ll expand on that in a bit) and gives an output and hidden state with size <code class="language-plaintext highlighter-rouge">15</code> (again, not exactly <code class="language-plaintext highlighter-rouge">15</code>).</p>

<p>An LSTM layer expects input to be in the shape of <code class="language-plaintext highlighter-rouge">(max_sentence_len, batch_size, input_size)</code>. For example, if the input is of size <code class="language-plaintext highlighter-rouge">(1, 1, 5)</code> then here, sentence only contains one token and our batch only contains one sentence and that one token is being represented by a vector of length <code class="language-plaintext highlighter-rouge">5</code> (this vector can be anything, an embedding vector or preprocessed vector from some other layer). Similarly, the output is in the shape of <code class="language-plaintext highlighter-rouge">(max_sentence_len, batch_size, hidden_size)</code>. So the output for our example input should be of the shape <code class="language-plaintext highlighter-rouge">(1, 1, 15)</code>: one token on our only sentence in the batch is now being represented by a vector of length <code class="language-plaintext highlighter-rouge">20</code>. Lets see through code, what the shapes of the outputs and hidden states will be.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (1, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([1, 1, 5])
output shape	: torch.Size([1, 1, 15])
hidden shape	: torch.Size([1, 1, 15])
cell shape	: torch.Size([1, 1, 15])
</code></pre></div></div>

<p>Lets try with a sentence length greater than 1.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 1, 5])
output shape	: torch.Size([4, 1, 15])
hidden shape	: torch.Size([1, 1, 15])
cell shape	: torch.Size([1, 1, 15])
</code></pre></div></div>

<p>Output is in the expected shape of <code class="language-plaintext highlighter-rouge">(max_sentence_len, batch_size, hidden_size)</code>: our LSTM unit iterates through each of the <code class="language-plaintext highlighter-rouge">4</code> tokens of the sentence and generates an output for each. So, <code class="language-plaintext highlighter-rouge">4</code> outputs. At the same time, it is also generating a hidden state and cell state with each output, which are then fed into the next iteration with the next token. This goes on till all the tokens are processed. And hence, we only have one hidden and cell state from the sentence after the final iteration. This final hidden state is also called a <strong>context vector</strong> as it’s kind of a representation of our whole sentence in a single vector of length <code class="language-plaintext highlighter-rouge">15</code>.</p>

<p>Lets see the shape when we have bigger batch size.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 6, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 6, 5])
output shape	: torch.Size([4, 6, 15])
hidden shape	: torch.Size([1, 6, 15])
cell shape	: torch.Size([1, 6, 15])
</code></pre></div></div>

<p>A batch size of <code class="language-plaintext highlighter-rouge">6</code> means that we are going to process <code class="language-plaintext highlighter-rouge">6</code> sentences, all having <code class="language-plaintext highlighter-rouge">4</code> tokens each with <code class="language-plaintext highlighter-rouge">5</code> being the length of the numerical representation (word embedding) of each token. In the output, hidden state and cell state, the 2nd dimension is same as batch size i.e., <code class="language-plaintext highlighter-rouge">6</code>. This shows that we have outputs, hidden states and cell states for all the <code class="language-plaintext highlighter-rouge">6</code> sentences. Which makes sense as we want to process all the sentences in the batch. All other things are same as before: hidden and cell states are <code class="language-plaintext highlighter-rouge">6</code> vectors of length <code class="language-plaintext highlighter-rouge">15</code> and output is a vector of length <code class="language-plaintext highlighter-rouge">15</code> for each token in each sentence.</p>

<p>Shape summary for this layer:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In Size  : (max_seq_len, batch_size, input_size)
Out Size : (max_seq_len, batch_size, hidden_size)
Hid Size : (1, batch_size, hidden_size)
Cell Size: (1, batch_size, hidden_size)
</code></pre></div></div>

<h2 id="iv-stacked-lstm-layer-">IV. Stacked LSTM Layer <a name="stacked_lstm"></a></h2>

<p>Stacked LSTM layer is made up of LSTMs stacked on top of the other LSTM. Here is the <em>rolled</em> view of stacked LSTMS with two levels.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-09/stacked_lstm1.png" /></p>

<p>In this diagram, L1 and L2 are two layers meaning this stacked LSTM layer has two internal levels. Flow on the right is shown for an input sequence of length <code class="language-plaintext highlighter-rouge">3</code>. First input token, \(I_1\), and L1 initial states - \(H_0\) and \(C_0\) - will go in L1 LSTM to produce output, \(O_1\), and hidden states - \(H_1\) and \(C_1\). Now, the output, \(O_1\) will be the input for the L2 LSTM along with the L2 initial states - \(H_0\) and \(C_0\). L2 LSTM will produce an output, \(O_1\) and hidden states - \(H_1\) and \(C_1\). The new hidden states are fed back into the LSTMs with the next input token and the process continues. Here, is the same thing unrolled.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-09/stacked_lstm2.png" /></p>

<p>Lets go through it in code.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="n">input_dim</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">hidden_dim</span> <span class="o">=</span> <span class="mi">15</span>
<span class="n">num_layers</span> <span class="o">=</span> <span class="mi">2</span>

<span class="n">lstm</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">,</span> <span class="n">num_layers</span><span class="o">=</span><span class="n">num_layers</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">lstm</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LSTM(5, 15, num_layers=2, dropout=0.5)
</code></pre></div></div>

<p>Here, <code class="language-plaintext highlighter-rouge">num_layers</code> is self-explanatory, <code class="language-plaintext highlighter-rouge">dropout</code> needs explanation. As we discussed in the Dropout layer section that dropout helps with the regularization, here also it is doing the same thing. Before the output becomes the input for the next layer LSTM, it is passed through a dropout layer internally. Pytorch gave us a way to define the dropout probability for that purpose. All other things are same as before - input size of <code class="language-plaintext highlighter-rouge">5</code> and hidden size of <code class="language-plaintext highlighter-rouge">15</code>. Lets check the output shapes.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (1, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([1, 1, 5])
output shape	: torch.Size([1, 1, 15])
hidden shape	: torch.Size([2, 1, 15])
cell shape	: torch.Size([2, 1, 15])
</code></pre></div></div>

<p>All the things are same as the simple LSTM layer except the size <code class="language-plaintext highlighter-rouge">2</code> of hidden and cell states. You guessed it right, because we have two sets of LSTMs, we also have two sets of hidden and cell states. You can also see this depicted in the unrolled figure above - hidden states as (\(H_3\), \(H_3\)) and cell states as (\(C_3\), \(C_3\)) from L1 and L2 each.</p>

<p>Lets observe the changes after increasing the sentence length.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 1, 5])
output shape	: torch.Size([4, 1, 15])
hidden shape	: torch.Size([2, 1, 15])
cell shape	: torch.Size([2, 1, 15])
</code></pre></div></div>

<p>If you have followed along till now then the shapes should make sense. It’s same as simple LSTM layer except having two sets of hidden and cell state now.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 6, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 6, 5])
output shape	: torch.Size([4, 6, 15])
hidden shape	: torch.Size([2, 6, 15])
cell shape	: torch.Size([2, 6, 15])
</code></pre></div></div>

<p>This is also as expected. Lets update the shape summary for the LSTM layer:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In Size  : (max_seq_len, batch_size, input_size)
Out Size : (max_seq_len, batch_size, hidden_size)
Hid Size : (num_layers, batch_size, hidden_size)
Cell Size: (num_layers, batch_size, hidden_size)
</code></pre></div></div>

<p>Note: Since we had only <code class="language-plaintext highlighter-rouge">1</code> layer in the LSTM layer, our Hidden and Cell sizes were <code class="language-plaintext highlighter-rouge">(1, batch_size, hidden_size)</code> which is a specific case of stacked LSTM layer.</p>

<h2 id="v-bidirectional-lstm-layer-">V. Bidirectional LSTM Layer <a name="bi_lstm"></a></h2>

<p>It’s better to go through bidirectional LSTM layer through the diagram first.</p>

<p><img src="https://trigonaminima.github.io/assets/2019-09/bilstm.jpg" /></p>

<p>There are <code class="language-plaintext highlighter-rouge">2</code> layers of LSTMs which are going parallel to each other: one is forward (blue) and one is backward (green), hence bidirectional. Lets assume we have a sequence, <code class="language-plaintext highlighter-rouge">[a, b, c]</code>. In the forward LSTMs (blue layer) the sequence will be processed in order, that is, first <code class="language-plaintext highlighter-rouge">a</code> then <code class="language-plaintext highlighter-rouge">b</code> followed by <code class="language-plaintext highlighter-rouge">c</code>. Whereas, in the backward LSTMs (green layer) the sequence is processed in reverse order, that is, first <code class="language-plaintext highlighter-rouge">c</code> then <code class="language-plaintext highlighter-rouge">b</code> and then <code class="language-plaintext highlighter-rouge">a</code>. With the processing being done in parallel, their outputs are concatenated together. So output from first forward LSTM and last backward LSTM are concatenated, shown in the figure by [\(O_1\); \(O_3\)] and so on.</p>

<p>Let’s look at the code.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">input_dim</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">hidden_dim</span> <span class="o">=</span> <span class="mi">15</span>

<span class="n">lstm</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">,</span> <span class="n">bidirectional</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">lstm</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LSTM(5, 15, bidirectional=True)
</code></pre></div></div>

<p>We are keeping input size and hidden size same as before to make it easy for comparison. In the <code class="language-plaintext highlighter-rouge">nn.LSTM</code> we just have to provide <code class="language-plaintext highlighter-rouge">bidirectional=True</code> to make our LSTM layer bidirectional. All other things will remain same.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (1, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([1, 1, 5])
output shape	: torch.Size([1, 1, 30])
hidden shape	: torch.Size([2, 1, 15])
cell shape	: torch.Size([2, 1, 15])
</code></pre></div></div>

<p>Here everything changed from the vanila LSTM layer. Output size is double of <code class="language-plaintext highlighter-rouge">hidden_size</code> (<code class="language-plaintext highlighter-rouge">15*2</code>). And hidden and cell shape are same as in stacked LSTM. Output is doubled because the outputs of two LSTMs are concatenated making is <code class="language-plaintext highlighter-rouge">2*hidden_size</code>. And since we have two parallel layers of LSTMs, we’ll have two pairs of hidden and cell states as well.</p>

<p>Here’s the bidirectional LSTM outputs for inputs with bigger sentence length.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 1, 5])
output shape	: torch.Size([4, 1, 30])
hidden shape	: torch.Size([2, 1, 15])
cell shape	: torch.Size([2, 1, 15])
</code></pre></div></div>

<p>Now that we know the pattern, it easy to see what the numbers show here. Sequence length is <code class="language-plaintext highlighter-rouge">4</code>, hence in the output we have <code class="language-plaintext highlighter-rouge">4</code> rows, and <code class="language-plaintext highlighter-rouge">30</code> (<code class="language-plaintext highlighter-rouge">15*2</code>) is because of the concatenated outputs from two LSTMs. Hidden and cell states are same as in the previous case.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="c1"># random input is of shape - (4, 6, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="n">cell</span><span class="p">)</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cell shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">cell</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 6, 5])
output shape	: torch.Size([4, 6, 30])
hidden shape	: torch.Size([2, 6, 15])
cell shape	: torch.Size([2, 6, 15])
</code></pre></div></div>

<p>The number <code class="language-plaintext highlighter-rouge">6</code> here represents batch size, that is, outputs and hidden states are generated for all the <code class="language-plaintext highlighter-rouge">6</code>, sentences as was previously happening. Apart from that everything is same as before. Shape summary for bidirectional LSTM layer is as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In Size  : (max_seq_len, batch_size, input_size)
Out Size : (max_seq_len, batch_size, hidden_size*2)
Hid Size : (2, batch_size, hidden_size)
Cell Size: (2, batch_size, hidden_size)
</code></pre></div></div>

<p>Combine, stacked LSTMs with bidirectional LSTMs, we’ll get the following final shapes. Every other shape in LSTM layer is derived from this generic shape.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In Size  : (max_seq_len, batch_size, input_size)
Out Size : (max_seq_len, batch_size, hidden_size * num_directions)
Hid Size : (num_layers * num_directions, batch_size, hidden_size)
Cell Size: (num_layers * num_directions, batch_size, hidden_size)
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">num_directions</code> can only be 1 or 2.</p>

<h2 id="vi-gru-layer-">VI. GRU Layer <a name="gru"></a></h2>

<p>Another big thing. GRU, short for Gated Recurrent Unit, is another type of recurrent unit which is considered better form of LSTMs. That post, <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a> by Chris Olah also talks about GRUs. You should read that post to get a better understanding.</p>

<p>In terms of inputs and outputs, one difference between GRUs and LSTMs is that there is no cell state in GRUs. So, with an input, you’ll get output and hidden states as return values. Other than that everything else will remain same. You can also make it stacked and bidirectional as LSTM layer. That’s why I didn’t produce a figure for this one.</p>

<p>Just to show how it works in code-</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">input_dim</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">hidden_dim</span> <span class="o">=</span> <span class="mi">15</span>

<span class="n">gru</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">GRU</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">)</span>
<span class="n">gru</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GRU(5, 15)
</code></pre></div></div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 6, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span><span class="p">,</span> <span class="n">hid</span> <span class="o">=</span> <span class="n">gru</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"hidden shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">hid</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 6, 5])
output shape	: torch.Size([4, 6, 15])
hidden shape	: torch.Size([1, 6, 15])
</code></pre></div></div>

<p>If you have gone through the LSTM shapes then all the shapes above should make sense.</p>

<p>Here’s the shape summary for GRU layer:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>In Size  : (max_seq_len, batch_size, input_size)
Out Size : (max_seq_len, batch_size, hidden_size * num_directions)
Hid Size : (num_layers * num_directions, batch_size, hidden_size)
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">num_directions</code> can only be 1 or 2.</p>

<h2 id="vii-linear-layer-">VII. Linear Layer <a name="lin"></a></h2>

<p>Linear layer, in black box representation, is a box which takes an input in some shape and gives an output in any required shape. While training, it learns how to convert the input from one shape to another. Another way to look at it is, linear layer is just a linear regression where it also learns to generate features along with the coefficients. In a way, a Linear layer is sort of a <strong>bridge</strong> between any two layers. If there is any shape mis-match just add a linear layer in between which make the output of previous layer appropriate to be used as an input for the next layer. You can also call it a projector, as it kind of projects one vector onto other. I have also read people explaining it as a connector for two layers.</p>

<p>Here’s the code-</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">input_dim</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">output_dim</span> <span class="o">=</span> <span class="mi">15</span>

<span class="n">linear_layer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">output_dim</span><span class="p">)</span>
<span class="n">linear_layer</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Linear(in_features=5, out_features=15, bias=True)
</code></pre></div></div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span> <span class="o">=</span> <span class="n">linear_layer</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Note: Since we had only <code class="language-plaintext highlighter-rouge">1</code> layer in the LSTM layer, our Hidden and Cell sizes were <code class="language-plaintext highlighter-rouge">(1, batch_size, hidden_size)</code> which is a specific case of stacked LSTM layer.
    input shape	: torch.Size([5])
    output shape	: torch.Size([15])</p>

<p>As you can see, a vector of length <code class="language-plaintext highlighter-rouge">5</code> is converted into the vector of length <code class="language-plaintext highlighter-rouge">15</code>. For different shapes we get the following results:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (1, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span> <span class="o">=</span> <span class="n">linear_layer</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([1, 1, 5])
output shape	: torch.Size([1, 1, 15])
</code></pre></div></div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 1, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span> <span class="o">=</span> <span class="n">linear_layer</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 1, 5])
output shape	: torch.Size([4, 1, 15])
</code></pre></div></div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="c1"># Random input with shape - (4, 6, 5)
</span><span class="n">inp</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"input shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">inp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="n">out</span> <span class="o">=</span> <span class="n">linear_layer</span><span class="p">(</span><span class="n">inp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"output shape</span><span class="se">\t</span><span class="s">:"</span><span class="p">,</span> <span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input shape	: torch.Size([4, 6, 5])
output shape	: torch.Size([4, 6, 15])
</code></pre></div></div>

<p>In the above three code segments, we added a batch with single sentence, increased the sentence length and finally increased the number of sentences in a batch. In all the cases, the token representation (or embedding) of length <code class="language-plaintext highlighter-rouge">5</code> got transformed into a vector of length <code class="language-plaintext highlighter-rouge">15</code>.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Above explained 7 layers - <strong>Embedding</strong>, <strong>Dropout</strong>, <strong>LSTM</strong>, <strong>Stacked LSTM</strong>, <strong>Bidirectional LSTM</strong>, <strong>GRU</strong>, <strong>Linear</strong> - are the major components used to make a Seq2Seq architecture. There are other smaller components like <code class="language-plaintext highlighter-rouge">softmax</code>, <code class="language-plaintext highlighter-rouge">tanh</code>, etc which I didn’t talk about. These are generic layers which are used in many other traditional machine learning algorithms as well. Also, they can be easily explained while we are working with Seq2Seq architectures.</p>

<p>Next up are attentional interfaces which make a Seq2Seq model highly performant. They are also used in Transformers which are better than Seq2Seq in many many aspects. In fact, this attention mechanism eliminated the used of RNNs in Transformers making it more efficient and effective. But I am getting ahead of myself here.</p>]]></content><author><name>Shivam Rana</name></author><category term="NLP" /><category term="DL" /><summary type="html"><![CDATA[Real world data is almost always in bad shape. You have to clean it properly to make any use of it. Cleaning becomes more important if this is your training data for a machine learning model. And these problems especially become worse if you are dealing with short text. I am talking about the text generated on platforms like Twitter, Facebook, YouTube, Instagram, WhatsApp, Telegram etc. These platforms and many others, also give searching capabilities, which is another source of such short textual data. We make a lot of mistakes. It’s common to make typos or use non-standard (short) forms of various words. It’s also common to have non-standard abbreviations by the interlocutors. It’s also common for multilingual users to use romanized forms of their native language mixed in with the English text (phenomenon called, code switching). I discovered many of these issues when I worked on a Telegram chatbot to do something fun with the conversations happening. Here are details on what all I did.]]></summary></entry><entry><title type="html">Understanding WX notation</title><link href="https://trigonaminima.github.io/2019/03/wx_notation/" rel="alternate" type="text/html" title="Understanding WX notation" /><published>2019-03-02T00:00:00+00:00</published><updated>2019-03-02T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2019/03/wx_notation</id><content type="html" xml:base="https://trigonaminima.github.io/2019/03/wx_notation/"><![CDATA[<p>In this post, I’ll discuss the <a href="https://en.wikipedia.org/wiki/WX_notation">WX notation</a>, which is used for computational processing of Indian languages. We’ll work with Devanagri script which has 47 primary characters - 14 vowels ans 33 consonants. We’ll see how using WX notation, we can convert from Devanagari unicode characters to Roman ASCII characters. This process of conversion of scripts is called <a href="https://en.wikipedia.org/wiki/Transliteration">transliteration</a>. So WX notation is a transliteration scheme which is specifically made for NLP. Note that, wx is not same as <a href="https://trigonaminima.github.io/2018/06/hinglish-and-transliteration/">informal transliteration</a> used in general conversations. Each word will only have a single WX notation.</p>

<p>To understand how it works and why to use it, lets cover some background topics.</p>

<ol>
  <li><a href="#groundwork">Groundwork</a>
    <ol>
      <li><a href="#dev">Devanagari Script for Hindi</a></li>
      <li><a href="#prefix">Prefix Code</a></li>
      <li><a href="#unicode_ascii">Size: Unicode vs ASCII</a></li>
    </ol>
  </li>
  <li><a href="#why_wx">Why use WX notation</a></li>
  <li><a href="#how_wx">How WX works?</a></li>
  <li><a href="#wx">WX implementation</a></li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">string</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
</code></pre></div></div>

<h2 id="groundwork-">Groundwork <a name="groundwork"></a></h2>

<h3 id="1-devanagari-script-">1. Devanagari Script <a name="dev"></a></h3>

<p>Since WX works on Devanagari script, it’ll be good to have some understanding of the Devanagari character set - vowels and consonants - and how they combine thogether to make a word. Devanagari script has the following characterstics-</p>

<ol>
  <li>Conventions for writing in Devanagari focus on pronunciation.</li>
  <li>There is no concept of letter case like in Roman script</li>
  <li>A horizontal line runs along the top of full letters (a visual way to identify Devanagari script)</li>
</ol>

<p>The arrangement of Devanagari letters is called varnamala (वर्णमाला)</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="n">hin_vowels</span> <span class="o">=</span> <span class="p">[</span><span class="s">"अ"</span><span class="p">,</span> <span class="s">"आ"</span><span class="p">,</span> <span class="s">"इ"</span><span class="p">,</span> <span class="s">"ई"</span><span class="p">,</span> <span class="s">"उ"</span><span class="p">,</span> <span class="s">"ऊ"</span><span class="p">,</span> <span class="s">"ए"</span><span class="p">,</span> <span class="s">"ऐ"</span><span class="p">,</span> <span class="s">"ओ"</span><span class="p">,</span> <span class="s">"औ"</span><span class="p">]</span>
<span class="n">hin_sonorants</span> <span class="o">=</span> <span class="p">[</span><span class="s">"ऋ"</span><span class="p">,</span> <span class="s">"ॠ"</span><span class="p">,</span> <span class="s">"ऌ"</span><span class="p">]</span>
<span class="n">hin_anuswara</span> <span class="o">=</span> <span class="p">[</span><span class="s">"अं"</span><span class="p">]</span>
<span class="n">hin_nukta</span> <span class="o">=</span> <span class="p">[</span><span class="s">"़"</span><span class="p">]</span>
<span class="n">hin_consonants</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">"क"</span><span class="p">,</span> <span class="s">"ख"</span><span class="p">,</span> <span class="s">"ग"</span><span class="p">,</span> <span class="s">"घ"</span><span class="p">,</span> <span class="s">"ङ"</span><span class="p">,</span>
    <span class="s">"च"</span><span class="p">,</span> <span class="s">"छ"</span><span class="p">,</span> <span class="s">"ज"</span><span class="p">,</span> <span class="s">"झ"</span><span class="p">,</span> <span class="s">"ञ"</span><span class="p">,</span>
    <span class="s">"ट"</span><span class="p">,</span> <span class="s">"ठ"</span><span class="p">,</span> <span class="s">"ड"</span><span class="p">,</span> <span class="s">"ढ"</span><span class="p">,</span> <span class="s">"ण"</span><span class="p">,</span>
    <span class="s">"त"</span><span class="p">,</span> <span class="s">"थ"</span><span class="p">,</span> <span class="s">"द"</span><span class="p">,</span> <span class="s">"ध"</span><span class="p">,</span> <span class="s">"न"</span><span class="p">,</span>
    <span class="s">"प"</span><span class="p">,</span> <span class="s">"फ"</span><span class="p">,</span> <span class="s">"ब"</span><span class="p">,</span> <span class="s">"भ"</span><span class="p">,</span> <span class="s">"म"</span><span class="p">,</span>
    <span class="s">"य"</span><span class="p">,</span> <span class="s">"र"</span><span class="p">,</span> <span class="s">"ल"</span><span class="p">,</span> <span class="s">"व"</span><span class="p">,</span>
    <span class="s">"श"</span><span class="p">,</span> <span class="s">"ष"</span><span class="p">,</span> <span class="s">"स"</span><span class="p">,</span> <span class="s">"ह"</span>
<span class="p">]</span>
</pre></td></tr></tbody></table></code></pre></figure>

<h3 id="2-prefix-code-">2. Prefix Code <a name="prefix"></a></h3>

<p>An example first. While adding the two factor authentication on any of your online account, the form asks for your cellphone number. It’s usually prefixed with a country code or they ask you to add the country code. For India, it’s +91. Now, if you look at the <a href="https://en.wikipedia.org/wiki/List_of_country_calling_codes">complete list of country codes</a>, you will not find any other country code starting with +91.</p>

<p>We will take the complete country codes list, take a random country code and check whether any other country code starts with the random country code. Let’s see it in action.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre><span class="n">all_country_codes</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">27</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">31</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">33</span><span class="p">,</span> <span class="mi">34</span><span class="p">,</span> <span class="mi">36</span><span class="p">,</span> <span class="mi">39</span><span class="p">,</span> <span class="mi">40</span><span class="p">,</span> <span class="mi">41</span><span class="p">,</span> <span class="mi">43</span><span class="p">,</span> <span class="mi">44</span><span class="p">,</span> <span class="mi">45</span><span class="p">,</span> <span class="mi">46</span><span class="p">,</span> <span class="mi">47</span><span class="p">,</span>
    <span class="mi">48</span><span class="p">,</span> <span class="mi">49</span><span class="p">,</span> <span class="mi">51</span><span class="p">,</span> <span class="mi">52</span><span class="p">,</span> <span class="mi">53</span><span class="p">,</span> <span class="mi">54</span><span class="p">,</span> <span class="mi">55</span><span class="p">,</span> <span class="mi">56</span><span class="p">,</span> <span class="mi">57</span><span class="p">,</span> <span class="mi">58</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">61</span><span class="p">,</span> <span class="mi">62</span><span class="p">,</span> <span class="mi">63</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">65</span><span class="p">,</span> <span class="mi">66</span><span class="p">,</span> <span class="mi">81</span><span class="p">,</span>
    <span class="mi">82</span><span class="p">,</span> <span class="mi">84</span><span class="p">,</span> <span class="mi">86</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="mi">91</span><span class="p">,</span> <span class="mi">92</span><span class="p">,</span> <span class="mi">93</span><span class="p">,</span> <span class="mi">94</span><span class="p">,</span> <span class="mi">95</span><span class="p">,</span> <span class="mi">98</span><span class="p">,</span> <span class="mi">211</span><span class="p">,</span> <span class="mi">212</span><span class="p">,</span> <span class="mi">213</span><span class="p">,</span> <span class="mi">216</span><span class="p">,</span> <span class="mi">218</span><span class="p">,</span> <span class="mi">220</span><span class="p">,</span>
    <span class="mi">221</span><span class="p">,</span> <span class="mi">222</span><span class="p">,</span> <span class="mi">223</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">225</span><span class="p">,</span> <span class="mi">226</span><span class="p">,</span> <span class="mi">227</span><span class="p">,</span> <span class="mi">228</span><span class="p">,</span> <span class="mi">229</span><span class="p">,</span> <span class="mi">230</span><span class="p">,</span> <span class="mi">231</span><span class="p">,</span> <span class="mi">232</span><span class="p">,</span> <span class="mi">233</span><span class="p">,</span> <span class="mi">234</span><span class="p">,</span>
    <span class="mi">235</span><span class="p">,</span> <span class="mi">236</span><span class="p">,</span> <span class="mi">237</span><span class="p">,</span> <span class="mi">238</span><span class="p">,</span> <span class="mi">239</span><span class="p">,</span> <span class="mi">240</span><span class="p">,</span> <span class="mi">241</span><span class="p">,</span> <span class="mi">242</span><span class="p">,</span> <span class="mi">243</span><span class="p">,</span> <span class="mi">244</span><span class="p">,</span> <span class="mi">245</span><span class="p">,</span> <span class="mi">246</span><span class="p">,</span> <span class="mi">247</span><span class="p">,</span> <span class="mi">248</span><span class="p">,</span>
    <span class="mi">249</span><span class="p">,</span> <span class="mi">250</span><span class="p">,</span> <span class="mi">251</span><span class="p">,</span> <span class="mi">252</span><span class="p">,</span> <span class="mi">253</span><span class="p">,</span> <span class="mi">254</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">257</span><span class="p">,</span> <span class="mi">258</span><span class="p">,</span> <span class="mi">260</span><span class="p">,</span> <span class="mi">261</span><span class="p">,</span> <span class="mi">262</span><span class="p">,</span> <span class="mi">263</span><span class="p">,</span>
    <span class="mi">264</span><span class="p">,</span> <span class="mi">265</span><span class="p">,</span> <span class="mi">266</span><span class="p">,</span> <span class="mi">267</span><span class="p">,</span> <span class="mi">268</span><span class="p">,</span> <span class="mi">269</span><span class="p">,</span> <span class="mi">290</span><span class="p">,</span> <span class="mi">291</span><span class="p">,</span> <span class="mi">297</span><span class="p">,</span> <span class="mi">298</span><span class="p">,</span> <span class="mi">299</span><span class="p">,</span> <span class="mi">350</span><span class="p">,</span> <span class="mi">351</span><span class="p">,</span> <span class="mi">352</span><span class="p">,</span>
    <span class="mi">353</span><span class="p">,</span> <span class="mi">354</span><span class="p">,</span> <span class="mi">355</span><span class="p">,</span> <span class="mi">356</span><span class="p">,</span> <span class="mi">357</span><span class="p">,</span> <span class="mi">358</span><span class="p">,</span> <span class="mi">359</span><span class="p">,</span> <span class="mi">370</span><span class="p">,</span> <span class="mi">371</span><span class="p">,</span> <span class="mi">372</span><span class="p">,</span> <span class="mi">373</span><span class="p">,</span> <span class="mi">374</span><span class="p">,</span> <span class="mi">375</span><span class="p">,</span> <span class="mi">376</span><span class="p">,</span>
    <span class="mi">377</span><span class="p">,</span> <span class="mi">378</span><span class="p">,</span> <span class="mi">379</span><span class="p">,</span> <span class="mi">380</span><span class="p">,</span> <span class="mi">381</span><span class="p">,</span> <span class="mi">382</span><span class="p">,</span> <span class="mi">383</span><span class="p">,</span> <span class="mi">385</span><span class="p">,</span> <span class="mi">386</span><span class="p">,</span> <span class="mi">387</span><span class="p">,</span> <span class="mi">389</span><span class="p">,</span> <span class="mi">420</span><span class="p">,</span> <span class="mi">421</span><span class="p">,</span> <span class="mi">423</span><span class="p">,</span>
    <span class="mi">500</span><span class="p">,</span> <span class="mi">501</span><span class="p">,</span> <span class="mi">502</span><span class="p">,</span> <span class="mi">503</span><span class="p">,</span> <span class="mi">504</span><span class="p">,</span> <span class="mi">505</span><span class="p">,</span> <span class="mi">506</span><span class="p">,</span> <span class="mi">507</span><span class="p">,</span> <span class="mi">508</span><span class="p">,</span> <span class="mi">509</span><span class="p">,</span> <span class="mi">590</span><span class="p">,</span> <span class="mi">591</span><span class="p">,</span> <span class="mi">592</span><span class="p">,</span> <span class="mi">593</span><span class="p">,</span>
    <span class="mi">594</span><span class="p">,</span> <span class="mi">595</span><span class="p">,</span> <span class="mi">596</span><span class="p">,</span> <span class="mi">597</span><span class="p">,</span> <span class="mi">598</span><span class="p">,</span> <span class="mi">599</span><span class="p">,</span> <span class="mi">670</span><span class="p">,</span> <span class="mi">672</span><span class="p">,</span> <span class="mi">673</span><span class="p">,</span> <span class="mi">674</span><span class="p">,</span> <span class="mi">675</span><span class="p">,</span> <span class="mi">676</span><span class="p">,</span> <span class="mi">677</span><span class="p">,</span> <span class="mi">678</span><span class="p">,</span>
    <span class="mi">679</span><span class="p">,</span> <span class="mi">680</span><span class="p">,</span> <span class="mi">681</span><span class="p">,</span> <span class="mi">682</span><span class="p">,</span> <span class="mi">683</span><span class="p">,</span> <span class="mi">685</span><span class="p">,</span> <span class="mi">686</span><span class="p">,</span> <span class="mi">687</span><span class="p">,</span> <span class="mi">688</span><span class="p">,</span> <span class="mi">689</span><span class="p">,</span> <span class="mi">690</span><span class="p">,</span> <span class="mi">691</span><span class="p">,</span> <span class="mi">692</span><span class="p">,</span> <span class="mi">800</span><span class="p">,</span>
    <span class="mi">808</span><span class="p">,</span> <span class="mi">850</span><span class="p">,</span> <span class="mi">852</span><span class="p">,</span> <span class="mi">853</span><span class="p">,</span> <span class="mi">855</span><span class="p">,</span> <span class="mi">856</span><span class="p">,</span> <span class="mi">870</span><span class="p">,</span> <span class="mi">878</span><span class="p">,</span> <span class="mi">880</span><span class="p">,</span> <span class="mi">881</span><span class="p">,</span> <span class="mi">882</span><span class="p">,</span> <span class="mi">883</span><span class="p">,</span> <span class="mi">886</span><span class="p">,</span> <span class="mi">888</span><span class="p">,</span>
    <span class="mi">960</span><span class="p">,</span> <span class="mi">961</span><span class="p">,</span> <span class="mi">962</span><span class="p">,</span> <span class="mi">963</span><span class="p">,</span> <span class="mi">964</span><span class="p">,</span> <span class="mi">965</span><span class="p">,</span> <span class="mi">966</span><span class="p">,</span> <span class="mi">967</span><span class="p">,</span> <span class="mi">968</span><span class="p">,</span> <span class="mi">970</span><span class="p">,</span> <span class="mi">971</span><span class="p">,</span> <span class="mi">972</span><span class="p">,</span> <span class="mi">973</span><span class="p">,</span> <span class="mi">974</span><span class="p">,</span>
    <span class="mi">975</span><span class="p">,</span> <span class="mi">976</span><span class="p">,</span> <span class="mi">977</span><span class="p">,</span> <span class="mi">979</span><span class="p">,</span> <span class="mi">992</span><span class="p">,</span> <span class="mi">993</span><span class="p">,</span> <span class="mi">994</span><span class="p">,</span> <span class="mi">995</span><span class="p">,</span> <span class="mi">996</span><span class="p">,</span> <span class="mi">998</span>
<span class="p">}</span>

<span class="k">def</span> <span class="nf">get_codes_starting_with</span><span class="p">(</span><span class="n">prefix</span><span class="p">):</span>
    <span class="s">"""
    Prints all the country codes starting with the `prefix`.
    """</span>
    <span class="n">found_codes</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">code</span> <span class="ow">in</span> <span class="n">all_country_codes</span><span class="p">:</span>
        <span class="k">if</span> <span class="nb">str</span><span class="p">(</span><span class="n">code</span><span class="p">).</span><span class="n">startswith</span><span class="p">(</span><span class="n">prefix</span><span class="p">):</span>
            <span class="n">found_codes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">found_codes</span>


<span class="n">check_codes</span> <span class="o">=</span> <span class="p">[</span><span class="s">"91"</span><span class="p">,</span> <span class="s">"1"</span><span class="p">,</span> <span class="s">"7"</span><span class="p">,</span> <span class="s">"41"</span><span class="p">,</span> <span class="s">"57"</span><span class="p">]</span>
<span class="k">for</span> <span class="n">check_code</span> <span class="ow">in</span> <span class="n">check_codes</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Prefix to check:"</span><span class="p">,</span> <span class="n">check_code</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Found match:"</span><span class="p">,</span> <span class="o">*</span><span class="n">get_codes_starting_with</span><span class="p">(</span><span class="n">check_code</span><span class="p">))</span>
    <span class="k">print</span><span class="p">()</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Prefix to check: 91
Found match: 91

Prefix to check: 1
Found match: 1

Prefix to check: 7
Found match: 7

Prefix to check: 41
Found match: 41

Prefix to check: 57
Found match: 57
</code></pre></div></div>

<p>For each case, only the country code itself was found as a match. Prefix codes have a very useful property - given a sequence, you can identify each word uniquely without the need of any marker between words. Let’s take the example of country codes again. We’ll take 10 random country codes, concatenate them together into a single string and then we’ll decode the string into the original 10 components.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">2019</span><span class="p">)</span>
<span class="n">rand_codes</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choices</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">all_country_codes</span><span class="p">),</span> <span class="n">k</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Random country codes:"</span><span class="p">,</span> <span class="o">*</span><span class="n">rand_codes</span><span class="p">)</span>
<span class="n">rand_codes_combined</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">rand_codes</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Concatenated codes string:"</span><span class="p">,</span> <span class="n">rand_codes_combined</span><span class="p">)</span>

<span class="n">orig_rand_codes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">current_code</span> <span class="o">=</span> <span class="s">""</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">rand_codes_combined</span><span class="p">:</span>
    <span class="n">current_code</span> <span class="o">+=</span> <span class="n">i</span>
    <span class="k">if</span> <span class="nb">int</span><span class="p">(</span><span class="n">current_code</span><span class="p">)</span> <span class="ow">in</span> <span class="n">all_country_codes</span><span class="p">:</span>
        <span class="n">orig_rand_codes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">current_code</span><span class="p">)</span>
        <span class="n">current_code</span> <span class="o">=</span> <span class="s">""</span>

<span class="k">print</span><span class="p">(</span><span class="s">"Decoded parts:"</span><span class="p">,</span> <span class="o">*</span><span class="n">orig_rand_codes</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Random country codes: 421 381 994 65 853 421 507 993 382 503
Concatenated codes string: 42138199465853421507993382503
Decoded parts: 421 381 994 65 853 421 507 993 382 503
</code></pre></div></div>

<p>As you can see, decoding the sequence was very easy. And we didn’t need any separator between the words.</p>

<h3 id="3-size-unicode-vs-ascii-">3. Size: Unicode vs ASCII <a name="unicode_ascii"></a></h3>

<p>You can find a lot of literature on Unicode and ASCII. Their utilities, differences, etc. I’ll discuss the size differences in Devanagari script and Roman script. Actually, Unicode is a superset of ASCII; the numbers 0-128 have the same meaning in ASCII, as they have in Unicode. Each ASCII character can be defined by using an 8-bit byte, whereas each Devanagari script character won’t fit in a single byte, so multiple bytes are required to represent 1 character.</p>

<p>Let’s look at the actual sizes of all the Roman and Devanagari characters.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="k">print</span><span class="p">(</span><span class="s">"Roman characters"</span><span class="p">)</span>
<span class="n">roman_chars</span> <span class="o">=</span> <span class="n">string</span><span class="p">.</span><span class="n">ascii_letters</span><span class="p">[:</span><span class="mi">26</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">roman_char</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">roman_chars</span><span class="p">):</span>
    <span class="k">print</span><span class="p">((</span><span class="n">roman_char</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">roman_char</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'utf8'</span><span class="p">))),</span> <span class="n">end</span><span class="o">=</span><span class="s">" "</span><span class="p">)</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">%</span><span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">()</span>
<span class="k">print</span><span class="p">()</span>

<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Devanagari characters"</span><span class="p">)</span>
<span class="n">devanagari_chars</span> <span class="o">=</span> <span class="n">hin_vowels</span> <span class="o">+</span> <span class="n">hin_sonorants</span> <span class="o">+</span> <span class="n">hin_anuswara</span> <span class="o">+</span> <span class="n">hin_consonants</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">devanagari_char</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">devanagari_chars</span><span class="p">):</span>
    <span class="k">print</span><span class="p">((</span><span class="n">devanagari_char</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">devanagari_char</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'utf8'</span><span class="p">))),</span> <span class="n">end</span><span class="o">=</span><span class="s">" "</span><span class="p">)</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">%</span><span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">()</span>
<span class="k">print</span><span class="p">()</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Roman characters
('a', 1) ('b', 1) ('c', 1) ('d', 1) ('e', 1) ('f', 1) ('g', 1) ('h', 1) ('i', 1)
('j', 1) ('k', 1) ('l', 1) ('m', 1) ('n', 1) ('o', 1) ('p', 1) ('q', 1) ('r', 1)
('s', 1) ('t', 1)('u', 1) ('v', 1) ('w', 1) ('x', 1) ('y', 1) ('z', 1)

Devanagari characters
('अ', 3) ('आ', 3) ('इ', 3) ('ई', 3) ('उ', 3) ('ऊ', 3) ('ए', 3) ('ऐ', 3) ('ओ', 3)
('औ', 3) ('ऋ', 3) ('ॠ', 3) ('ऌ', 3) ('अं', 6) ('क', 3) ('ख', 3) ('ग', 3) ('घ', 3)
('ङ', 3) ('च', 3) ('छ', 3) ('ज', 3) ('झ', 3) ('ञ', 3) ('ट', 3) ('ठ', 3) ('ड', 3)
('ढ', 3) ('ण', 3) ('त', 3) ('थ', 3) ('द', 3) ('ध', 3) ('न', 3) ('प', 3) ('फ', 3)
('ब', 3) ('भ', 3) ('म', 3) ('य', 3) ('र', 3) ('ल', 3) ('व', 3) ('श', 3) ('ष', 3)
('स', 3) ('ह', 3)
</code></pre></div></div>

<p>So all the Roman characters take 1 Byte each, whereas, all the Devanagari characters take 3 Bytes each in memory (except on which takes 6). Thus, Devanagari characters (Unicode) are more memory intensive than Roman characters (ASCII). And becasue of this, working with ASCII characters is more efficient.</p>

<h2 id="why-use-wx-notation-">Why use WX notation? <a name="why_wx"></a></h2>

<p>Since WX was made specifically for NLP; it tries to make many things efficient and easy.</p>

<ul>
  <li>Computational and Memory Efficiency
    <ol>
      <li>In WX, every consonant and every vowel has a single mapping into Roman. Making it a prefix code. Advantageous of view we discussed in the previous section.</li>
      <li>As we are working with ASCII rather than Unicode, we also get memory efficiency. How it is memory efficient is discussed in the previous section.</li>
    </ol>
  </li>
  <li>Readability
    <ol>
      <li>WX allows one to read any Indic language string even if (s)he has no idea about the original script. This helps in analysis of the developed system.</li>
    </ol>
  </li>
</ul>

<h2 id="how-wx-works-">How WX works? <a name="how_wx"></a></h2>

<p>Now that we have understood the basic concept related to Devanagari script and the reasons why WX notation is helpful for us, we’ll get into the workings of WX notation.</p>

<h3 id="hindi-to-wx">Hindi to WX</h3>

<p>At the base of WX notation is the following character mapping. Note that this mapping is complete. Actual mapping includes handling of various corner cases and more characters that are not a part of actual <em>varnamala</em>. I’ll still show how the conversion is done using the below defined mapping. I’ll take a few Hindi words, their true WX notation (determined using this <a href="http://sanskrit.uohyd.ac.in/scl/">online Sanskrit toolkit</a>) and our function output.</p>

<p>Here’s our Hindi to ASCII character mapping.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
</pre></td><td class="code"><pre><span class="n">hin2wx_vowels</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"अ"</span><span class="p">:</span> <span class="s">"a"</span><span class="p">,</span>
    <span class="s">"आ"</span><span class="p">:</span> <span class="s">"A"</span><span class="p">,</span>
    <span class="s">"इ"</span><span class="p">:</span> <span class="s">"i"</span><span class="p">,</span>
    <span class="s">"ई"</span><span class="p">:</span> <span class="s">"I"</span><span class="p">,</span>
    <span class="s">"उ"</span><span class="p">:</span> <span class="s">"u"</span><span class="p">,</span>
    <span class="s">"ऊ"</span><span class="p">:</span> <span class="s">"U"</span><span class="p">,</span>
    <span class="s">"ए"</span><span class="p">:</span> <span class="s">"e"</span><span class="p">,</span>
    <span class="s">"ऐ"</span><span class="p">:</span> <span class="s">"E"</span><span class="p">,</span>
    <span class="s">"ओ"</span><span class="p">:</span> <span class="s">"o"</span><span class="p">,</span>
    <span class="s">"औ"</span><span class="p">:</span> <span class="s">"O"</span><span class="p">,</span>
    <span class="s">"ै"</span><span class="p">:</span> <span class="s">"E"</span><span class="p">,</span>
    <span class="s">"ा"</span><span class="p">:</span> <span class="s">"A"</span><span class="p">,</span>
    <span class="s">"ो"</span><span class="p">:</span> <span class="s">"o"</span><span class="p">,</span>
    <span class="s">"ू"</span><span class="p">:</span> <span class="s">"U"</span><span class="p">,</span>
    <span class="s">"ु"</span><span class="p">:</span> <span class="s">"u"</span><span class="p">,</span>
    <span class="s">"ि"</span><span class="p">:</span> <span class="s">"i"</span><span class="p">,</span>
    <span class="s">"ी"</span><span class="p">:</span> <span class="s">"I"</span><span class="p">,</span>
    <span class="s">"े"</span><span class="p">:</span> <span class="s">"e"</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">hin2wx_sonorants</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"ऋ"</span><span class="p">:</span> <span class="s">"q"</span><span class="p">,</span>
    <span class="s">"ॠ"</span><span class="p">:</span> <span class="s">"Q"</span><span class="p">,</span>
    <span class="s">"ऌ"</span><span class="p">:</span> <span class="s">"L"</span>
<span class="p">}</span>
<span class="n">hin2wx_anuswara</span> <span class="o">=</span> <span class="p">{</span><span class="s">"अं"</span><span class="p">:</span> <span class="s">"M"</span><span class="p">,</span> <span class="s">"ं"</span><span class="p">:</span> <span class="s">"M"</span><span class="p">}</span>
<span class="n">hin2wx_consonants</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"क"</span><span class="p">:</span> <span class="s">"k"</span><span class="p">,</span>
    <span class="s">"ख"</span><span class="p">:</span> <span class="s">"K"</span><span class="p">,</span>
    <span class="s">"ग"</span><span class="p">:</span> <span class="s">"g"</span><span class="p">,</span>
    <span class="s">"घ"</span><span class="p">:</span> <span class="s">"G"</span><span class="p">,</span>
    <span class="s">"ङ"</span><span class="p">:</span> <span class="s">"f"</span><span class="p">,</span>
    <span class="s">"च"</span><span class="p">:</span> <span class="s">"c"</span><span class="p">,</span>
    <span class="s">"छ"</span><span class="p">:</span> <span class="s">"C"</span><span class="p">,</span>
    <span class="s">"ज"</span><span class="p">:</span> <span class="s">"j"</span><span class="p">,</span>
    <span class="s">"झ"</span><span class="p">:</span> <span class="s">"J"</span><span class="p">,</span>
    <span class="s">"ञ"</span><span class="p">:</span> <span class="s">"F"</span><span class="p">,</span>
    <span class="s">"ट"</span><span class="p">:</span> <span class="s">"t"</span><span class="p">,</span>
    <span class="s">"ठ"</span><span class="p">:</span> <span class="s">"T"</span><span class="p">,</span>
    <span class="s">"ड"</span><span class="p">:</span> <span class="s">"d"</span><span class="p">,</span>
    <span class="s">"ढ"</span><span class="p">:</span> <span class="s">"D"</span><span class="p">,</span>
    <span class="s">"ण"</span><span class="p">:</span> <span class="s">"N"</span><span class="p">,</span>
    <span class="s">"त"</span><span class="p">:</span> <span class="s">"w"</span><span class="p">,</span>
    <span class="s">"थ"</span><span class="p">:</span> <span class="s">"W"</span><span class="p">,</span>
    <span class="s">"द"</span><span class="p">:</span> <span class="s">"x"</span><span class="p">,</span>
    <span class="s">"ध"</span><span class="p">:</span> <span class="s">"X"</span><span class="p">,</span>
    <span class="s">"न"</span><span class="p">:</span> <span class="s">"n"</span><span class="p">,</span>
    <span class="s">"प"</span><span class="p">:</span> <span class="s">"p"</span><span class="p">,</span>
    <span class="s">"फ"</span><span class="p">:</span> <span class="s">"P"</span><span class="p">,</span>
    <span class="s">"ब"</span><span class="p">:</span> <span class="s">"b"</span><span class="p">,</span>
    <span class="s">"भ"</span><span class="p">:</span> <span class="s">"B"</span><span class="p">,</span>
    <span class="s">"म"</span><span class="p">:</span> <span class="s">"m"</span><span class="p">,</span>
    <span class="s">"य"</span><span class="p">:</span> <span class="s">"y"</span><span class="p">,</span>
    <span class="s">"र"</span><span class="p">:</span> <span class="s">"r"</span><span class="p">,</span>
    <span class="s">"ल"</span><span class="p">:</span> <span class="s">"l"</span><span class="p">,</span>
    <span class="s">"व"</span><span class="p">:</span> <span class="s">"v"</span><span class="p">,</span>
    <span class="s">"श"</span><span class="p">:</span> <span class="s">"S"</span><span class="p">,</span>
    <span class="s">"ष"</span><span class="p">:</span> <span class="s">"R"</span><span class="p">,</span>
    <span class="s">"स"</span><span class="p">:</span> <span class="s">"s"</span><span class="p">,</span>
    <span class="s">"ह"</span><span class="p">:</span> <span class="s">"h"</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">hin2wx_all</span> <span class="o">=</span> <span class="p">{</span>
    <span class="o">**</span><span class="n">hin2wx_vowels</span><span class="p">,</span> <span class="o">**</span><span class="n">hin2wx_anuswara</span><span class="p">,</span>
    <span class="o">**</span><span class="n">hin2wx_sonorants</span><span class="p">,</span> <span class="o">**</span><span class="n">hin2wx_consonants</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Now, we’ll define the Hindi to ASCII conversion function.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">is_vowel_hin</span><span class="p">(</span><span class="n">char</span><span class="p">):</span>
    <span class="s">"""
    Checks if the character is a vowel.
    """</span>
    <span class="k">if</span> <span class="n">char</span> <span class="ow">in</span> <span class="n">hin2wx_anuswara</span> <span class="ow">or</span> <span class="n">char</span> <span class="ow">in</span> <span class="n">hin2wx_vowels</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">True</span>
    <span class="k">return</span> <span class="bp">False</span>


<span class="k">def</span> <span class="nf">hin2wx</span><span class="p">(</span><span class="n">hin_string</span><span class="p">):</span>
    <span class="s">"""
    Converts the Hindi string to the WX string.

    This function goes through each character from the hin_string and
    maps it to a corresponding Roman character according to the
    Devanagari to Roman character mapping defined previously.
    """</span>
    <span class="n">wx_string</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">current_char</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">hin_string</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]):</span>
        <span class="c1"># skipping over the character as it's not included
</span>        <span class="c1"># in the mapping
</span>        <span class="k">if</span> <span class="n">current_char</span> <span class="o">==</span> <span class="s">"्"</span><span class="p">:</span>
            <span class="k">continue</span>

        <span class="c1"># get the Roman character for the Devanagari character
</span>        <span class="n">wx_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">hin2wx_all</span><span class="p">[</span><span class="n">current_char</span><span class="p">])</span>

        <span class="c1"># Handling of "a" sound after a consonant if the next
</span>        <span class="c1"># character is not "्" which makes the previous character half
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">is_vowel_hin</span><span class="p">(</span><span class="n">current_char</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">hin_string</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="s">"्"</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">is_vowel_hin</span><span class="p">(</span><span class="n">hin_string</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]):</span>
                <span class="n">wx_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">hin2wx_all</span><span class="p">[</span><span class="s">"अ"</span><span class="p">])</span>

    <span class="n">wx_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">hin2wx_all</span><span class="p">[</span><span class="n">hin_string</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]])</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">is_vowel_hin</span><span class="p">(</span><span class="n">hin_string</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]):</span>
        <span class="n">wx_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">hin2wx_all</span><span class="p">[</span><span class="s">"अ"</span><span class="p">])</span>

    <span class="n">wx_string</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">wx_string</span><span class="p">)</span>

    <span class="c1"># consonant + anuswara should be replaced by
</span>    <span class="c1"># consonant + "a" sound + anuswara
</span>    <span class="n">reg1</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"([kKgGfcCjJFtTdDNwWxXnpPbBmyrlvSRsh])M"</span><span class="p">)</span>
    <span class="n">wx_string</span> <span class="o">=</span> <span class="n">reg1</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">"\g&lt;1&gt;aM"</span><span class="p">,</span> <span class="n">wx_string</span><span class="p">)</span>

    <span class="c1"># consonant + anuswara should be replaced by
</span>    <span class="c1"># consonant + "a" sound + anuswara
</span>    <span class="n">reg1</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"([kKgGfcCjJFtTdDNwWxXnpPbBmyrlvSRsh])M"</span><span class="p">)</span>
    <span class="n">wx_string</span> <span class="o">=</span> <span class="n">reg1</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">"\g&lt;1&gt;aM"</span><span class="p">,</span> <span class="n">wx_string</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">wx_string</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Let’s evaluate our conversion function.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
</pre></td><td class="code"><pre><span class="n">pairs</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"शहरों"</span><span class="p">,</span> <span class="s">"SaharoM"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"खूबसूरत"</span><span class="p">,</span> <span class="s">"KUbasUrawa"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"बैंगलोर"</span><span class="p">,</span> <span class="s">"bEMgalora"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"कोलकाता"</span><span class="p">,</span> <span class="s">"kolakAwA"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"हैदराबाद"</span><span class="p">,</span> <span class="s">"hExarAbAxa"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"कोझिकोडे"</span><span class="p">,</span> <span class="s">"koJikode"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"सफर"</span><span class="p">,</span> <span class="s">"saPara"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"उसमे"</span><span class="p">,</span> <span class="s">"usame"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"संभावनाओं"</span><span class="p">,</span> <span class="s">"saMBAvanAoM"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"मुंबई"</span><span class="p">,</span> <span class="s">"muMbaI"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"नई"</span><span class="p">,</span> <span class="s">"naI"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"मंगलवार"</span><span class="p">,</span> <span class="s">"maMgalavAra"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"घंटे"</span><span class="p">,</span> <span class="s">"GaMte"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"ट्रंप"</span><span class="p">,</span> <span class="s">"traMpa"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"डोनाल्ड"</span><span class="p">,</span> <span class="s">"donAlda"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"स्टेट"</span><span class="p">,</span> <span class="s">"steta"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"संगठन"</span><span class="p">,</span> <span class="s">"saMgaTana"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"प्रतिबंध"</span><span class="p">,</span> <span class="s">"prawibaMXa"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"एंड"</span><span class="p">,</span> <span class="s">"eMda"</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"अंदेशे"</span><span class="p">,</span> <span class="s">"aMxeSe"</span><span class="p">)</span>
<span class="p">]</span>

<span class="n">test_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">pairs</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">,</span> <span class="s">"Actual WX"</span><span class="p">])</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Our WX"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">hin2wx</span><span class="p">)</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Both WX eq?"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Actual WX"</span><span class="p">]</span> <span class="o">==</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Our WX"</span><span class="p">]</span>
<span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">print</span><span class="p">(</span><span class="n">test_df</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="rendered_html">
    <table style="margin-left: 0;">
      <thead>
        <tr>
          <th></th>
          <th>Hindi String</th>
          <th>Actual WX</th>
          <th>Our WX</th>
          <th>Both WX eq?</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>1</th>
          <td>शहरों</td>
          <td>SaharoM</td>
          <td>SaharoM</td>
          <td>True</td>
        </tr>
        <tr>
          <th>2</th>
          <td>खूबसूरत</td>
          <td>KUbasUrawa</td>
          <td>KUbasUrawa</td>
          <td>True</td>
        </tr>
        <tr>
          <th>3</th>
          <td>बैंगलोर</td>
          <td>bEMgalora</td>
          <td>bEMgalora</td>
          <td>True</td>
        </tr>
        <tr>
          <th>4</th>
          <td>कोलकाता</td>
          <td>kolakAwA</td>
          <td>kolakAwA</td>
          <td>True</td>
        </tr>
        <tr>
          <th>5</th>
          <td>हैदराबाद</td>
          <td>hExarAbAxa</td>
          <td>hExarAbAxa</td>
          <td>True</td>
        </tr>
        <tr>
          <th>6</th>
          <td>कोझिकोडे</td>
          <td>koJikode</td>
          <td>koJikode</td>
          <td>True</td>
        </tr>
        <tr>
          <th>7</th>
          <td>सफर</td>
          <td>saPara</td>
          <td>saPara</td>
          <td>True</td>
        </tr>
        <tr>
          <th>8</th>
          <td>उसमे</td>
          <td>usame</td>
          <td>usame</td>
          <td>True</td>
        </tr>
        <tr>
          <th>9</th>
          <td>संभावनाओं</td>
          <td>saMBAvanAoM</td>
          <td>saMBAvanAoM</td>
          <td>True</td>
        </tr>
        <tr>
          <th>10</th>
          <td>मुंबई</td>
          <td>muMbaI</td>
          <td>muMbI</td>
          <td>False</td>
        </tr>
        <tr>
          <th>11</th>
          <td>नई</td>
          <td>naI</td>
          <td>nI</td>
          <td>False</td>
        </tr>
        <tr>
          <th>12</th>
          <td>मंगलवार</td>
          <td>maMgalavAra</td>
          <td>maMgalavAra</td>
          <td>True</td>
        </tr>
        <tr>
          <th>13</th>
          <td>घंटे</td>
          <td>GaMte</td>
          <td>GaMte</td>
          <td>True</td>
        </tr>
        <tr>
          <th>14</th>
          <td>ट्रंप</td>
          <td>traMpa</td>
          <td>traMpa</td>
          <td>True</td>
        </tr>
        <tr>
          <th>15</th>
          <td>डोनाल्ड</td>
          <td>donAlda</td>
          <td>donAlda</td>
          <td>True</td>
        </tr>
        <tr>
          <th>16</th>
          <td>स्टेट</td>
          <td>steta</td>
          <td>steta</td>
          <td>True</td>
        </tr>
        <tr>
          <th>17</th>
          <td>संगठन</td>
          <td>saMgaTana</td>
          <td>saMgaTana</td>
          <td>True</td>
        </tr>
        <tr>
          <th>18</th>
          <td>प्रतिबंध</td>
          <td>prawibaMXa</td>
          <td>prawibaMXa</td>
          <td>True</td>
        </tr>
        <tr>
          <th>19</th>
          <td>एंड</td>
          <td>eMda</td>
          <td>eMda</td>
          <td>True</td>
        </tr>
        <tr>
          <th>20</th>
          <td>अंदेशे</td>
          <td>aMxeSe</td>
          <td>aMxeSe</td>
          <td>True</td>
        </tr>
      </tbody>
    </table>
</div>

<p>As you can see, most of the cases are correctly converted by our conversion function. I have deliberately left out 2 cases to show that this function is imcomplete. Just like I handled the anuswara case, this and other cases where vowels are there needs to be handled. Further, there are more characters which are not included in the mapping. I wanted to show how a WX conversion function will work based on the provided mapping.</p>

<h3 id="wx-to-hindi">WX to Hindi</h3>

<p>Let’s do the reverse now - conversion of WX to Hindi. For this we’ll start with the creation of our reverse mapping.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
</pre></td><td class="code"><pre><span class="n">wx2hin_vowels</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"a"</span><span class="p">:</span> <span class="s">"अ"</span><span class="p">,</span>
    <span class="s">"A"</span><span class="p">:</span> <span class="s">"आ"</span><span class="p">,</span>
    <span class="s">"i"</span><span class="p">:</span> <span class="s">"इ"</span><span class="p">,</span>
    <span class="s">"I"</span><span class="p">:</span> <span class="s">"ई"</span><span class="p">,</span>
    <span class="s">"u"</span><span class="p">:</span> <span class="s">"उ"</span><span class="p">,</span>
    <span class="s">"U"</span><span class="p">:</span> <span class="s">"ऊ"</span><span class="p">,</span>
    <span class="s">"e"</span><span class="p">:</span> <span class="s">"ए"</span><span class="p">,</span>
    <span class="s">"E"</span><span class="p">:</span> <span class="s">"ऐ"</span><span class="p">,</span>
    <span class="s">"o"</span><span class="p">:</span> <span class="s">"ओ"</span><span class="p">,</span>
    <span class="s">"O"</span><span class="p">:</span> <span class="s">"औ"</span>
<span class="p">}</span>
<span class="n">wx2hin_vowels_half</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"A"</span><span class="p">:</span> <span class="s">"ा"</span><span class="p">,</span>
    <span class="s">"e"</span><span class="p">:</span> <span class="s">"े"</span><span class="p">,</span>
    <span class="s">"E"</span><span class="p">:</span> <span class="s">"ै"</span><span class="p">,</span>
    <span class="s">"i"</span><span class="p">:</span> <span class="s">"ि"</span><span class="p">,</span>
    <span class="s">"I"</span><span class="p">:</span> <span class="s">"ी"</span><span class="p">,</span>
    <span class="s">"o"</span><span class="p">:</span> <span class="s">"ो"</span><span class="p">,</span>
    <span class="s">"U"</span><span class="p">:</span> <span class="s">"ू"</span><span class="p">,</span>
    <span class="s">"u"</span><span class="p">:</span> <span class="s">"ु"</span>
<span class="p">}</span>
<span class="n">wx2hin_sonorants</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"q"</span><span class="p">:</span> <span class="s">"ऋ"</span><span class="p">,</span>
    <span class="s">"Q"</span><span class="p">:</span> <span class="s">"ॠ"</span><span class="p">,</span>
    <span class="s">"L"</span><span class="p">:</span> <span class="s">"ऌ"</span>
<span class="p">}</span>
<span class="n">wx2hin_anuswara</span> <span class="o">=</span> <span class="p">{</span><span class="s">"M"</span><span class="p">:</span> <span class="s">"अं"</span><span class="p">}</span>
<span class="n">wx2hin_anuswara_half</span> <span class="o">=</span> <span class="p">{</span><span class="s">"M"</span><span class="p">:</span> <span class="s">"ं"</span><span class="p">}</span>
<span class="n">wx2hin_consonants</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"k"</span><span class="p">:</span> <span class="s">"क"</span><span class="p">,</span>
    <span class="s">"K"</span><span class="p">:</span> <span class="s">"ख"</span><span class="p">,</span>
    <span class="s">"g"</span><span class="p">:</span> <span class="s">"ग"</span><span class="p">,</span>
    <span class="s">"G"</span><span class="p">:</span> <span class="s">"घ"</span><span class="p">,</span>
    <span class="s">"f"</span><span class="p">:</span> <span class="s">"ङ"</span><span class="p">,</span>
    <span class="s">"c"</span><span class="p">:</span> <span class="s">"च"</span><span class="p">,</span>
    <span class="s">"C"</span><span class="p">:</span> <span class="s">"छ"</span><span class="p">,</span>
    <span class="s">"j"</span><span class="p">:</span> <span class="s">"ज"</span><span class="p">,</span>
    <span class="s">"J"</span><span class="p">:</span> <span class="s">"झ"</span><span class="p">,</span>
    <span class="s">"F"</span><span class="p">:</span> <span class="s">"ञ"</span><span class="p">,</span>
    <span class="s">"t"</span><span class="p">:</span> <span class="s">"ट"</span><span class="p">,</span>
    <span class="s">"T"</span><span class="p">:</span> <span class="s">"ठ"</span><span class="p">,</span>
    <span class="s">"d"</span><span class="p">:</span> <span class="s">"ड"</span><span class="p">,</span>
    <span class="s">"D"</span><span class="p">:</span> <span class="s">"ढ"</span><span class="p">,</span>
    <span class="s">"N"</span><span class="p">:</span> <span class="s">"ण"</span><span class="p">,</span>
    <span class="s">"w"</span><span class="p">:</span> <span class="s">"त"</span><span class="p">,</span>
    <span class="s">"W"</span><span class="p">:</span> <span class="s">"थ"</span><span class="p">,</span>
    <span class="s">"x"</span><span class="p">:</span> <span class="s">"द"</span><span class="p">,</span>
    <span class="s">"X"</span><span class="p">:</span> <span class="s">"ध"</span><span class="p">,</span>
    <span class="s">"n"</span><span class="p">:</span> <span class="s">"न"</span><span class="p">,</span>
    <span class="s">"p"</span><span class="p">:</span> <span class="s">"प"</span><span class="p">,</span>
    <span class="s">"P"</span><span class="p">:</span> <span class="s">"फ"</span><span class="p">,</span>
    <span class="s">"b"</span><span class="p">:</span> <span class="s">"ब"</span><span class="p">,</span>
    <span class="s">"B"</span><span class="p">:</span> <span class="s">"भ"</span><span class="p">,</span>
    <span class="s">"m"</span><span class="p">:</span> <span class="s">"म"</span><span class="p">,</span>
    <span class="s">"y"</span><span class="p">:</span> <span class="s">"य"</span><span class="p">,</span>
    <span class="s">"r"</span><span class="p">:</span> <span class="s">"र"</span><span class="p">,</span>
    <span class="s">"l"</span><span class="p">:</span> <span class="s">"ल"</span><span class="p">,</span>
    <span class="s">"v"</span><span class="p">:</span> <span class="s">"व"</span><span class="p">,</span>
    <span class="s">"S"</span><span class="p">:</span> <span class="s">"श"</span><span class="p">,</span>
    <span class="s">"R"</span><span class="p">:</span> <span class="s">"ष"</span><span class="p">,</span>
    <span class="s">"s"</span><span class="p">:</span> <span class="s">"स"</span><span class="p">,</span>
    <span class="s">"h"</span><span class="p">:</span> <span class="s">"ह"</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">wx2hin_all</span> <span class="o">=</span> <span class="p">{</span>
    <span class="o">**</span><span class="n">wx2hin_vowels</span><span class="p">,</span>
    <span class="o">**</span><span class="n">wx2hin_vowels_half</span><span class="p">,</span>
    <span class="o">**</span><span class="n">wx2hin_sonorants</span><span class="p">,</span>
    <span class="o">**</span><span class="n">wx2hin_anuswara</span><span class="p">,</span>
    <span class="o">**</span><span class="n">wx2hin_anuswara_half</span><span class="p">,</span>
    <span class="o">**</span><span class="n">wx2hin_consonants</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>As before, we’ll new define the ASCII to Hindi conversion function.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">is_vowel_wx</span><span class="p">(</span><span class="n">char</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">char</span> <span class="ow">in</span> <span class="p">{</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"A"</span><span class="p">,</span> <span class="s">"e"</span><span class="p">,</span> <span class="s">"E"</span><span class="p">,</span> <span class="s">"i"</span><span class="p">,</span> <span class="s">"I"</span><span class="p">,</span> <span class="s">"o"</span><span class="p">,</span> <span class="s">"O"</span><span class="p">,</span> <span class="s">"u"</span><span class="p">,</span> <span class="s">"U"</span><span class="p">,</span> <span class="s">"M"</span><span class="p">}:</span>
        <span class="k">return</span> <span class="bp">True</span>
    <span class="k">return</span> <span class="bp">False</span>


<span class="k">def</span> <span class="nf">wx2hin</span><span class="p">(</span><span class="n">wx_string</span><span class="p">):</span>
    <span class="s">"""
    Converts the WX string to the Hindi string.

    This function goes through each character from the wx_string and
    maps it to a corresponding Devanagari character according to the
    Roman to Devanagari character mapping defined previously.
    """</span>
    <span class="n">wx_string</span> <span class="o">+=</span> <span class="s">" "</span>
    <span class="n">hin_string</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">roman_char</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">wx_string</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]):</span>
        <span class="k">if</span> <span class="n">is_vowel_wx</span><span class="p">(</span><span class="n">roman_char</span><span class="p">):</span>
            <span class="c1"># If current character is "a" and not the first character
</span>            <span class="c1"># then skip
</span>            <span class="k">if</span> <span class="n">roman_char</span> <span class="o">==</span> <span class="s">"a"</span> <span class="ow">and</span> <span class="n">i</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
                <span class="k">continue</span>

            <span class="k">if</span> <span class="n">roman_char</span> <span class="o">==</span> <span class="s">"M"</span><span class="p">:</span>
                <span class="n">hin_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">wx2hin_anuswara_half</span><span class="p">[</span><span class="n">roman_char</span><span class="p">])</span>
            <span class="k">elif</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">or</span> <span class="n">wx_string</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">"a"</span><span class="p">:</span>
                <span class="n">hin_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">wx2hin_vowels</span><span class="p">[</span><span class="n">roman_char</span><span class="p">])</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">hin_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">wx2hin_vowels_half</span><span class="p">[</span><span class="n">roman_char</span><span class="p">])</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">hin_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">wx2hin_all</span><span class="p">[</span><span class="n">roman_char</span><span class="p">])</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">is_vowel_wx</span><span class="p">(</span><span class="n">wx_string</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">])</span> <span class="ow">and</span> <span class="n">wx_string</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="s">" "</span><span class="p">:</span>
                <span class="n">hin_string</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"्"</span><span class="p">)</span>
    <span class="k">return</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">hin_string</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>And now, the evaluation of the our reverse conversion function.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">test_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">pairs</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">,</span> <span class="s">"Actual WX"</span><span class="p">])</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Our Hin"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Actual WX"</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">wx2hin</span><span class="p">)</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Both Hin eq?"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">]</span> <span class="o">==</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Our Hin"</span><span class="p">]</span>
<span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">test_df</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="rendered_html">
    <table style="margin-left: 0;">
  <thead>
    <tr>
      <th></th>
      <th>Hindi String</th>
      <th>Actual WX</th>
      <th>Our Hin</th>
      <th>Both Hin eq?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>शहरों</td>
      <td>SaharoM</td>
      <td>शहरों</td>
      <td>True</td>
    </tr>
    <tr>
      <th>2</th>
      <td>खूबसूरत</td>
      <td>KUbasUrawa</td>
      <td>खूबसूरत</td>
      <td>True</td>
    </tr>
    <tr>
      <th>3</th>
      <td>बैंगलोर</td>
      <td>bEMgalora</td>
      <td>बैंगलोर</td>
      <td>True</td>
    </tr>
    <tr>
      <th>4</th>
      <td>कोलकाता</td>
      <td>kolakAwA</td>
      <td>कोलकाता</td>
      <td>True</td>
    </tr>
    <tr>
      <th>5</th>
      <td>हैदराबाद</td>
      <td>hExarAbAxa</td>
      <td>हैदराबाद</td>
      <td>True</td>
    </tr>
    <tr>
      <th>6</th>
      <td>कोझिकोडे</td>
      <td>koJikode</td>
      <td>कोझिकोडे</td>
      <td>True</td>
    </tr>
    <tr>
      <th>7</th>
      <td>सफर</td>
      <td>saPara</td>
      <td>सफर</td>
      <td>True</td>
    </tr>
    <tr>
      <th>8</th>
      <td>उसमे</td>
      <td>usame</td>
      <td>उसमे</td>
      <td>True</td>
    </tr>
    <tr>
      <th>9</th>
      <td>संभावनाओं</td>
      <td>saMBAvanAoM</td>
      <td>संभावनाों</td>
      <td>False</td>
    </tr>
    <tr>
      <th>10</th>
      <td>मुंबई</td>
      <td>muMbaI</td>
      <td>मुंबई</td>
      <td>True</td>
    </tr>
    <tr>
      <th>11</th>
      <td>नई</td>
      <td>naI</td>
      <td>नई</td>
      <td>True</td>
    </tr>
    <tr>
      <th>12</th>
      <td>मंगलवार</td>
      <td>maMgalavAra</td>
      <td>मंगलवार</td>
      <td>True</td>
    </tr>
    <tr>
      <th>13</th>
      <td>घंटे</td>
      <td>GaMte</td>
      <td>घंटे</td>
      <td>True</td>
    </tr>
    <tr>
      <th>14</th>
      <td>ट्रंप</td>
      <td>traMpa</td>
      <td>ट्रंप</td>
      <td>True</td>
    </tr>
    <tr>
      <th>15</th>
      <td>डोनाल्ड</td>
      <td>donAlda</td>
      <td>डोनाल्ड</td>
      <td>True</td>
    </tr>
    <tr>
      <th>16</th>
      <td>स्टेट</td>
      <td>steta</td>
      <td>स्टेट</td>
      <td>True</td>
    </tr>
    <tr>
      <th>17</th>
      <td>संगठन</td>
      <td>saMgaTana</td>
      <td>संगठन</td>
      <td>True</td>
    </tr>
    <tr>
      <th>18</th>
      <td>प्रतिबंध</td>
      <td>prawibaMXa</td>
      <td>प्रतिबंध</td>
      <td>True</td>
    </tr>
    <tr>
      <th>19</th>
      <td>एंड</td>
      <td>eMda</td>
      <td>एंड</td>
      <td>True</td>
    </tr>
    <tr>
      <th>20</th>
      <td>अंदेशे</td>
      <td>aMxeSe</td>
      <td>अंदेशे</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>

<p>Only one case failed which is becasue the case of short and full vowels was not handled properly. There’ll be many such cases and thus this <code class="language-plaintext highlighter-rouge">wx2hin</code> conversion function is incomplete and just a toy implementation to show how it works.</p>

<h2 id="wx-implementation-">WX implementation <a name="wx"></a></h2>

<p>The complete implementation of this conversion between Devanagari and WX and reverse, can be found in this library - <a href="https://github.com/irshadbhat/indic-wx-converter/">wxconv</a>. It handles many other Indic languages. Lets try it out.</p>

<h3 id="hindi-to-wx-1">Hindi to WX</h3>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="kn">from</span> <span class="nn">wxconv</span> <span class="kn">import</span> <span class="n">WXC</span>

<span class="n">hin2wx</span> <span class="o">=</span> <span class="n">WXC</span><span class="p">(</span><span class="n">order</span><span class="o">=</span><span class="s">'utf2wx'</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s">"hin"</span><span class="p">).</span><span class="n">convert</span>

<span class="n">test_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">pairs</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">,</span> <span class="s">"Actual WX"</span><span class="p">])</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Our WX"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">hin2wx</span><span class="p">)</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Both WX eq?"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Actual WX"</span><span class="p">]</span> <span class="o">==</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Our WX"</span><span class="p">]</span>
<span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">test_df</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="rendered_html">
    <table style="margin-left: 0;">
  <thead>
    <tr>
      <th></th>
      <th>Hindi String</th>
      <th>Actual WX</th>
      <th>Our WX</th>
      <th>Both WX eq?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>शहरों</td>
      <td>SaharoM</td>
      <td>SaharoM</td>
      <td>True</td>
    </tr>
    <tr>
      <th>2</th>
      <td>खूबसूरत</td>
      <td>KUbasUrawa</td>
      <td>KUbasUrawa</td>
      <td>True</td>
    </tr>
    <tr>
      <th>3</th>
      <td>बैंगलोर</td>
      <td>bEMgalora</td>
      <td>bEMgalora</td>
      <td>True</td>
    </tr>
    <tr>
      <th>4</th>
      <td>कोलकाता</td>
      <td>kolakAwA</td>
      <td>kolakAwA</td>
      <td>True</td>
    </tr>
    <tr>
      <th>5</th>
      <td>हैदराबाद</td>
      <td>hExarAbAxa</td>
      <td>hExarAbAxa</td>
      <td>True</td>
    </tr>
    <tr>
      <th>6</th>
      <td>कोझिकोडे</td>
      <td>koJikode</td>
      <td>koJikode</td>
      <td>True</td>
    </tr>
    <tr>
      <th>7</th>
      <td>सफर</td>
      <td>saPara</td>
      <td>saPara</td>
      <td>True</td>
    </tr>
    <tr>
      <th>8</th>
      <td>उसमे</td>
      <td>usame</td>
      <td>usame</td>
      <td>True</td>
    </tr>
    <tr>
      <th>9</th>
      <td>संभावनाओं</td>
      <td>saMBAvanAoM</td>
      <td>saMBAvanAoM</td>
      <td>True</td>
    </tr>
    <tr>
      <th>10</th>
      <td>मुंबई</td>
      <td>muMbaI</td>
      <td>muMbaI</td>
      <td>True</td>
    </tr>
    <tr>
      <th>11</th>
      <td>नई</td>
      <td>naI</td>
      <td>naI</td>
      <td>True</td>
    </tr>
    <tr>
      <th>12</th>
      <td>मंगलवार</td>
      <td>maMgalavAra</td>
      <td>maMgalavAra</td>
      <td>True</td>
    </tr>
    <tr>
      <th>13</th>
      <td>घंटे</td>
      <td>GaMte</td>
      <td>GaMte</td>
      <td>True</td>
    </tr>
    <tr>
      <th>14</th>
      <td>ट्रंप</td>
      <td>traMpa</td>
      <td>traMpa</td>
      <td>True</td>
    </tr>
    <tr>
      <th>15</th>
      <td>डोनाल्ड</td>
      <td>donAlda</td>
      <td>donAlda</td>
      <td>True</td>
    </tr>
    <tr>
      <th>16</th>
      <td>स्टेट</td>
      <td>steta</td>
      <td>steta</td>
      <td>True</td>
    </tr>
    <tr>
      <th>17</th>
      <td>संगठन</td>
      <td>saMgaTana</td>
      <td>saMgaTana</td>
      <td>True</td>
    </tr>
    <tr>
      <th>18</th>
      <td>प्रतिबंध</td>
      <td>prawibaMXa</td>
      <td>prawibaMXa</td>
      <td>True</td>
    </tr>
    <tr>
      <th>19</th>
      <td>एंड</td>
      <td>eMda</td>
      <td>eMda</td>
      <td>True</td>
    </tr>
    <tr>
      <th>20</th>
      <td>अंदेशे</td>
      <td>aMxeSe</td>
      <td>aMxeSe</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>

<h3 id="wx-to-hindi-1">WX to Hindi</h3>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="n">wx2hin</span> <span class="o">=</span> <span class="n">WXC</span><span class="p">(</span><span class="n">order</span><span class="o">=</span><span class="s">'wx2utf'</span><span class="p">,</span> <span class="n">lang</span><span class="o">=</span><span class="s">"hin"</span><span class="p">).</span><span class="n">convert</span>
<span class="n">test_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">pairs</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">,</span> <span class="s">"Actual WX"</span><span class="p">])</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Our Hin"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Actual WX"</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">wx2hin</span><span class="p">)</span>
<span class="n">test_df</span><span class="p">[</span><span class="s">"Both Hin eq?"</span><span class="p">]</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Hindi String"</span><span class="p">]</span> <span class="o">==</span> <span class="n">test_df</span><span class="p">[</span><span class="s">"Our Hin"</span><span class="p">]</span>
<span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">test_df</span><span class="p">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">test_df</span>
</pre></td></tr></tbody></table></code></pre></figure>

<div class="rendered_html">
    <table style="margin-left: 0;">

  <thead>
    <tr>
      <th></th>
      <th>Hindi String</th>
      <th>Actual WX</th>
      <th>Our Hin</th>
      <th>Both Hin eq?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>शहरों</td>
      <td>SaharoM</td>
      <td>शहरों</td>
      <td>True</td>
    </tr>
    <tr>
      <th>2</th>
      <td>खूबसूरत</td>
      <td>KUbasUrawa</td>
      <td>खूबसूरत</td>
      <td>True</td>
    </tr>
    <tr>
      <th>3</th>
      <td>बैंगलोर</td>
      <td>bEMgalora</td>
      <td>बैंगलोर</td>
      <td>True</td>
    </tr>
    <tr>
      <th>4</th>
      <td>कोलकाता</td>
      <td>kolakAwA</td>
      <td>कोलकाता</td>
      <td>True</td>
    </tr>
    <tr>
      <th>5</th>
      <td>हैदराबाद</td>
      <td>hExarAbAxa</td>
      <td>हैदराबाद</td>
      <td>True</td>
    </tr>
    <tr>
      <th>6</th>
      <td>कोझिकोडे</td>
      <td>koJikode</td>
      <td>कोझिकोडे</td>
      <td>True</td>
    </tr>
    <tr>
      <th>7</th>
      <td>सफर</td>
      <td>saPara</td>
      <td>सफर</td>
      <td>True</td>
    </tr>
    <tr>
      <th>8</th>
      <td>उसमे</td>
      <td>usame</td>
      <td>उसमे</td>
      <td>True</td>
    </tr>
    <tr>
      <th>9</th>
      <td>संभावनाओं</td>
      <td>saMBAvanAoM</td>
      <td>संभावनाओं</td>
      <td>True</td>
    </tr>
    <tr>
      <th>10</th>
      <td>मुंबई</td>
      <td>muMbaI</td>
      <td>मुंबई</td>
      <td>True</td>
    </tr>
    <tr>
      <th>11</th>
      <td>नई</td>
      <td>naI</td>
      <td>नई</td>
      <td>True</td>
    </tr>
    <tr>
      <th>12</th>
      <td>मंगलवार</td>
      <td>maMgalavAra</td>
      <td>मंगलवार</td>
      <td>True</td>
    </tr>
    <tr>
      <th>13</th>
      <td>घंटे</td>
      <td>GaMte</td>
      <td>घंटे</td>
      <td>True</td>
    </tr>
    <tr>
      <th>14</th>
      <td>ट्रंप</td>
      <td>traMpa</td>
      <td>ट्रंप</td>
      <td>True</td>
    </tr>
    <tr>
      <th>15</th>
      <td>डोनाल्ड</td>
      <td>donAlda</td>
      <td>डोनाल्ड</td>
      <td>True</td>
    </tr>
    <tr>
      <th>16</th>
      <td>स्टेट</td>
      <td>steta</td>
      <td>स्टेट</td>
      <td>True</td>
    </tr>
    <tr>
      <th>17</th>
      <td>संगठन</td>
      <td>saMgaTana</td>
      <td>संगठन</td>
      <td>True</td>
    </tr>
    <tr>
      <th>18</th>
      <td>प्रतिबंध</td>
      <td>prawibaMXa</td>
      <td>प्रतिबंध</td>
      <td>True</td>
    </tr>
    <tr>
      <th>19</th>
      <td>एंड</td>
      <td>eMda</td>
      <td>एंड</td>
      <td>True</td>
    </tr>
    <tr>
      <th>20</th>
      <td>अंदेशे</td>
      <td>aMxeSe</td>
      <td>अंदेशे</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>

<p>As can be seen, every conversion is correct for the above selected cases.</p>

<p>Internally, this library has an extensive mapping between unicode and ISCII (and vice versa), and between ISCII and ASCII (and vice versa). Using these conversion tables, to obtain a WX notation of a Hindi string, it’ll first be converted to the ISCII representation and then from ISCII to ASCII.</p>]]></content><author><name>Shivam Rana</name></author><category term="NLP" /><category term="Hindi" /><summary type="html"><![CDATA[In this post, I’ll discuss the WX notation, which is used for computational processing of Indian languages. We’ll work with Devanagri script which has 47 primary characters - 14 vowels ans 33 consonants. We’ll see how using WX notation, we can convert from Devanagari unicode characters to Roman ASCII characters. This process of conversion of scripts is called transliteration. So WX notation is a transliteration scheme which is specifically made for NLP. Note that, wx is not same as informal transliteration used in general conversations. Each word will only have a single WX notation.]]></summary></entry><entry><title type="html">GCP with CNTK</title><link href="https://trigonaminima.github.io/2019/01/gcp-cntk/" rel="alternate" type="text/html" title="GCP with CNTK" /><published>2019-01-09T00:00:00+00:00</published><updated>2019-01-09T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2019/01/gcp-cntk</id><content type="html" xml:base="https://trigonaminima.github.io/2019/01/gcp-cntk/"><![CDATA[<p>Try to install <a href="https://github.com/Microsoft/CNTK">CNTK</a> on a Linux machine, if you get it working you are very lucky. I tried it on three - I was unsuccessful each time. A colleague suggested me to try a CNTK docker image. Since I also needed a GPU, I decided to do it on <a href="https://cloud.google.com/">GCP</a>. Why I chose GCP is because it gives you $300 worth of free credits valid for 12 months.</p>

<p>I want to give a rundown of the steps I followed. I also created a few scripts to do almost everything from the terminal.</p>

<h2 id="google-cloud-platform">Google Cloud Platform</h2>

<p>There are two modes of working with GCP - GCP dashboard in your browser and the GCP CLI. I’ll give details about the CLI way as after creating a VM you’ll be working through your Terminal only so why not do everything using your Terminal. Although, it’ll be helpful if you go through the Dashboard.</p>

<p>Before following along, create a project using the GCP Dashboard in your browser. This project will contain your VM instances. Go thorough this <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects">Creating and Managing Projects</a> page to create your first project.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># Create environment variable for correct distribution</span>
<span class="nb">export </span><span class="nv">CLOUD_SDK_REPO</span><span class="o">=</span><span class="s2">"cloud-sdk-</span><span class="si">$(</span>lsb_release <span class="nt">-c</span> <span class="nt">-s</span><span class="si">)</span><span class="s2">"</span>

<span class="c"># Add the Cloud SDK distribution URI as a package source</span>
<span class="nb">echo</span> <span class="s2">"deb http://packages.cloud.google.com/apt </span><span class="nv">$CLOUD_SDK_REPO</span><span class="s2"> main"</span> | <span class="nb">sudo tee</span> <span class="se">\</span>
    <span class="nt">-a</span> /etc/apt/sources.list.d/google-cloud-sdk.list

<span class="c"># Import the Google Cloud Platform public key</span>
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | <span class="nb">sudo </span>apt-key add -

<span class="c"># Update the package list and install the Cloud SDK</span>
<span class="nb">sudo </span>apt update <span class="o">&amp;&amp;</span> <span class="nb">sudo </span>apt-get <span class="nb">install </span>google-cloud-sdk</code></pre></figure>

<p>In the first instruction, you might face an error if your distro is not standard Debian/Ubuntu machine. For example, I work on Elementary OS, which is based on Ubuntu, but <code class="language-plaintext highlighter-rouge">lsb_release -c -s</code> gave the Elementary OS release name, not the standard Ubuntu release name. So, I found out the Ubuntu release my OS is based on and then manually set the <code class="language-plaintext highlighter-rouge">CLOUD_SDK_REPO</code> variable using that name.</p>

<p>To set up everything on your machine run the following command.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">gcloud init</code></pre></figure>

<p>Follow the instructions on the terminal. It’ll ask to link your Google account associated with the GCP one. Followed by the default settings like default project, region etc. Select any region you want. If you don’t have any preference then select <code class="language-plaintext highlighter-rouge">us-west1-b</code>. This will give you all the GPU options available. Some regions don’t have all the options.</p>

<p>This quick-start <a href="https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu">GCP Documentation</a> will show you what will happen once you run the command. It’ll also point you to the relevant points if you want some other things to take care of like proxies, etc.</p>

<p>To create and start a new VM instance now, you need to verify your payment mode. While verifying, it’ll do a test transaction (which they say is reverted after a few days; I haven’t received it yet though). This verification is to just ensure that you are legit. This <a href="https://cloud.google.com/billing/docs/how-to/verify-bank">GCP documentation page</a> will help you if there’s any other question.</p>

<p>Now we will create a VM using <code class="language-plaintext highlighter-rouge">gcloud</code> from our terminal.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
</pre></td><td class="code"><pre><span class="nb">echo</span> <span class="s2">"Current Google Cloud Projects with the linked Google account-"</span>
gcloud projects list

<span class="nb">export </span><span class="nv">PROJECT_NAME</span><span class="o">=</span><span class="s2">"gcp-project"</span>
<span class="nb">echo</span> <span class="s2">"Selected project - "</span><span class="nv">$PROJECT_NAME</span>

<span class="c"># export INSTANCE_NAME="my-fastai-instance"</span>
<span class="nb">export </span><span class="nv">INSTANCE_NAME</span><span class="o">=</span><span class="s2">"cntk-docker-vm"</span>

<span class="nb">export </span><span class="nv">ZONE</span><span class="o">=</span><span class="s2">"us-west2-b"</span> <span class="c"># budget: "us-west1-b"</span>

<span class="c"># budget: 'type=nvidia-tesla-k80'</span>
<span class="nb">export </span><span class="nv">INSTANCE_TYPE</span><span class="o">=</span><span class="s2">"n1-highmem-8"</span> <span class="c"># budget: "n1-highmem-4"</span>

<span class="c"># run "gcloud compute images list" to get a list of images and their families</span>
<span class="c"># or "pytorch-1-0-cpu-experimental" for non-GPU instances</span>
<span class="nb">export </span><span class="nv">IMAGE_FAMILY</span><span class="o">=</span><span class="s2">"pytorch-1-0-cu92-experimental"</span>

<span class="nb">export </span><span class="nv">HDD</span><span class="o">=</span><span class="s2">"500GB"</span>

gcloud compute instances create <span class="nv">$INSTANCE_NAME</span> <span class="se">\</span>
        <span class="nt">--zone</span><span class="o">=</span><span class="nv">$ZONE</span> <span class="se">\</span>
        <span class="nt">--image-family</span><span class="o">=</span><span class="nv">$IMAGE_FAMILY</span> <span class="se">\</span>
        <span class="nt">--image-project</span><span class="o">=</span>deeplearning-platform-release <span class="se">\</span>
        <span class="nt">--maintenance-policy</span><span class="o">=</span>TERMINATE <span class="se">\</span>
        <span class="nt">--accelerator</span><span class="o">=</span><span class="s2">"type=nvidia-tesla-p4,count=1"</span> <span class="se">\</span>
        <span class="nt">--machine-type</span><span class="o">=</span><span class="nv">$INSTANCE_TYPE</span> <span class="se">\</span>
        <span class="nt">--boot-disk-size</span><span class="o">=</span><span class="nv">$HDD</span> <span class="se">\</span>
        <span class="nt">--metadata</span><span class="o">=</span><span class="s2">"install-nvidia-driver=True"</span> <span class="se">\</span>
        <span class="c"># --preemptible</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>Line 2 will list all the active projects you have made. Set that project name as done in line 4. If you have a specific zone then set it in line 10, else leave it as is. In line 13 you’ll set the type of GPU (or CPU if budget is low) you want. There are other higher ones as well. In line 19, set the amount of HDD you want. And that’s it. All the variables are set. All these parameters will be used to create a VM for you. You can put the above code segment into a separate script and just automate this step.</p>

<p>There’s also this setting (line 30) for your VM which is important to know about. The option <code class="language-plaintext highlighter-rouge">preemptible</code> means that your machine will be stopped after 24 hours of continuous running. It can also be preempted (stopped) with a 30 seconds notice at any time due to high load. This option is for beginners as preemptible instances are cheaper and prevents the extra charges if you forget to stop the instance after using it. The <a href="https://cloud.google.com/compute/docs/instances/preemptible">Preemptible VM Instances</a> page will give you further details.</p>

<p>After the successful VM creation you can check its status displayed here - <a href="https://console.cloud.google.com/compute/">https://console.cloud.google.com/compute/</a>. A green tick will be there to indicate the successful creation and running of the instance. You might get an error about reaching your limit of GPUs or quota. To solve this, you’ll need to request the GPU quota from the quotas menu under IAM &amp; Admin. This SE question will help you out - <a href="https://serverfault.com/q/887256">https://serverfault.com/q/887256</a>. You should have the billing verification done before making a request for the quota increase otherwise it’ll be rejected.</p>

<p>Once you have green tick in the dashboard against your VM you’ll be able to SSH into it.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="nb">export </span><span class="nv">PROJECT_NAME</span><span class="o">=</span><span class="s2">"gcp-project"</span>
<span class="nb">echo</span> <span class="s2">"Selected project - "</span><span class="nv">$PROJECT_NAME</span>
gcloud config <span class="nb">set </span>project <span class="nv">$PROJECT_NAME</span>

<span class="nb">export </span><span class="nv">INSTANCE_NAME</span><span class="o">=</span><span class="s2">"cntk-docker-vm"</span>
<span class="nb">export </span><span class="nv">ZONE</span><span class="o">=</span><span class="s2">"us-west2-b"</span> <span class="c"># budget: "us-west1-b"</span>

<span class="nb">echo</span> <span class="s2">"Logging in..."</span>
gcloud compute ssh <span class="nt">--zone</span><span class="o">=</span><span class="nv">$ZONE</span> jupyter@<span class="nv">$INSTANCE_NAME</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>For the first time, it’ll generate your SSH key followed by a prompt for a login password. In the subsequent runs, it’ll just ask for the password to log you in.</p>

<p>To start/stop the VM from CLI just follow the given commands.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="nb">export </span><span class="nv">PROJECT_NAME</span><span class="o">=</span><span class="s2">"microsoft-ai-2018"</span>
<span class="nb">echo</span> <span class="s2">"Selected project - "</span><span class="nv">$PROJECT_NAME</span>
gcloud config <span class="nb">set </span>project microsoft-ai-2018

<span class="nb">export </span><span class="nv">INSTANCE_NAME</span><span class="o">=</span><span class="s2">"cntk-docker-trial"</span>
<span class="nb">export </span><span class="nv">ZONE</span><span class="o">=</span><span class="s2">"us-west2-b"</span> <span class="c"># budget: "us-west1-b"</span>

<span class="nb">echo</span> <span class="s2">"Starting..."</span>
gcloud compute instances start <span class="nt">--zone</span><span class="o">=</span><span class="nv">$ZONE</span> <span class="nv">$INSTANCE_NAME</span>

<span class="nb">echo</span> <span class="s2">"Stopping..."</span>
gcloud compute instances stop <span class="nt">--zone</span><span class="o">=</span><span class="nv">$ZONE</span> <span class="nv">$INSTANCE_NAME</span>
</pre></td></tr></tbody></table></code></pre></figure>

<h2 id="cntk-docker">CNTK Docker</h2>

<p>Docker is like a virtual machine software. It runs containers. Containers are application images having all the libraries, tools, configurations needed to run the application. This saves the developer’s time by automatically setting up the development environment and the dependencies.</p>

<p>At first, I tried the Microsoft made CNTK docker image. I tried installing the NVIDIA Docker image having Python 3.5. Sadly, this gave me the similar error I was getting when I tried installing it on my own laptop. What’s the point of providing a Docker image if you’ll get errors even while/after installing it and you have to deal with the same errors which you’d have faced while installing it yourself.</p>

<p>After wasting a few hours because, I thought, I was doing a mistake while working with docker as it was my first time working with it. I deleted my VM instance and created a new one and followed the same steps, but again the same issue. Then my colleague gave me this DL Docker image link - <a href="https://github.com/ufoym/deepo">Deepo</a>. I installed the GPU version and it worked on the first try. I thank the Deepo guy(s) for providing the Docker images.</p>

<p>Instruction to pull the docker image-</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">docker pull ufoym/deepo</code></pre></figure>

<p>Instruction to launch the GPU docker image after installing it is in the following code segment. This code block can also be put into a script to launch the container in one go.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># Enables Deepo to use the GPU from inside a docker container.</span>
nvidia-docker run <span class="nt">--rm</span> ufoym/deepo nvidia-smi

<span class="c"># Run the docker image + sharing the data b/w VM and the container</span>
<span class="c"># This will mount the ~/msftai2018 directory at /msftai2018 in the Docker container</span>
nvidia-docker run <span class="nt">-it</span> <span class="nt">-v</span> ~/working_dir:/working_dir ufoym/deepo bash</code></pre></figure>

<p>The above instructions and further more (Jupyter, CPU version of the image, customization) is listed in the readme of the <a href="https://github.com/ufoym/deepo">Deepo repo</a>.</p>]]></content><author><name>Shivam Rana</name></author><category term="DL" /><summary type="html"><![CDATA[Try to install CNTK on a Linux machine, if you get it working you are very lucky. I tried it on three - I was unsuccessful each time. A colleague suggested me to try a CNTK docker image. Since I also needed a GPU, I decided to do it on GCP. Why I chose GCP is because it gives you $300 worth of free credits valid for 12 months.]]></summary></entry><entry><title type="html">(Mis)adventures of Building a Chat Bot</title><link href="https://trigonaminima.github.io/2018/10/chatbot/" rel="alternate" type="text/html" title="(Mis)adventures of Building a Chat Bot" /><published>2018-10-06T00:00:00+00:00</published><updated>2018-10-06T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2018/10/chatbot</id><content type="html" xml:base="https://trigonaminima.github.io/2018/10/chatbot/"><![CDATA[<p>In a Telegram group of three, a fourth member was added - GB - a Telegram bot. GB is the most useless, unavailing and noisy bot there is, but it’s fun. I started working on it because, I have always been interested in building chat bots, but never had the chance. Secondly, and this is a rather ambitious thinking, I want to build a NLP system able to understand and act on the natural language. Lastly, this was more of a realization when I started my work on GB, opportunity to build NLP tools and techniques for Hindi+English (or <a href="https://en.Wikipedia.org/wiki/Hinglish">Hinglish</a>). In the process, I’ll be working on both basic NLP and some advanced NLP techniques. And, I’ll try to implement everything from scratch.</p>

<p><em>Note: Since our colloquial language in chats is transliterated Hindi + English, GB’s commands are also <a href="https://en.Wikipedia.org/wiki/Desi">desi</a> (read - <a href="/2018/06/hinglish-and-transliteration/">Hinglish and Transliteration</a>). GB’s code is also laden with desi variable, function, class, and file names. I’ll explain what the terms mean whenever I’ll use them.</em></p>

<p>I have been working on-off on this bot for the past 3-4 months now (243 commits). I have the basic structure ready. This post will discuss all the things GB has till now.</p>

<p>The echo function is the “Hello World” code for bots. Whatever anyone says, it’ll repeat that verbatim. This was the first commit on GB. The night before, we were discussing what commands should be present. Telegram bot API provides a command interface for the bots - a command is a word starting with “/” - and it’s easy to setup those. After echo testing, I added some of the commands discussed - <code class="language-plaintext highlighter-rouge">/random</code>, <code class="language-plaintext highlighter-rouge">/yaaddila</code>, <code class="language-plaintext highlighter-rouge">/yaadkar</code>, <code class="language-plaintext highlighter-rouge">/bhulja</code>, <code class="language-plaintext highlighter-rouge">/gaali</code>, <code class="language-plaintext highlighter-rouge">/ashleellaundakaun</code>.</p>

<h2 id="commands">Commands</h2>

<h3 id="random">Random</h3>

<p>All the group members know each other since college. And, each of us had been a part of some embarrassing situation where one said or did something silly and others made fun of him. This leg pulling still continues. We have prepared a list of such statements and whenever we use <code class="language-plaintext highlighter-rouge">/random</code> (now called, <code class="language-plaintext highlighter-rouge">/r</code>) we are presented a random incident and reminding us the incident. We often use it to roast each other.</p>

<h3 id="yaaddila-yaadkar-bhulja">Yaaddila, Yaadkar, Bhulja</h3>

<p>These, IMO are the most practical commands in GB. Sadly they are rarely used now. Their meanings are - remind me, keep this in mind, forget about it - respectively. Whenever there are some important bits that we don’t want to remember, but occasionally feel the need to use, we add it using <code class="language-plaintext highlighter-rouge">/yaadkar</code>. Information like, our server IP, our Blog links, etc is currently added here. Using <code class="language-plaintext highlighter-rouge">/yaaddila</code>, we can retrieve that info by giving the key. And, if the information is not needed anymore, <code class="language-plaintext highlighter-rouge">/bhulja</code> is used. All the key-value pairs are saved in a JSON file. The commands were later shortened to - <code class="language-plaintext highlighter-rouge">/yd</code>, <code class="language-plaintext highlighter-rouge">/yk</code>, <code class="language-plaintext highlighter-rouge">/bj</code>.</p>

<h3 id="gaali">Gaali</h3>

<p>This is one of the most entertaining command. Even though it just adds noise in the group, it’s usage is generally chucklesome. Gaali in Hindi means a cuss word. When you command GB using <code class="language-plaintext highlighter-rouge">/gaali</code> it selects a random expletive. If you use <code class="language-plaintext highlighter-rouge">/gaali &lt;username&gt;</code>, GB cusses at that user. This is the basic usage. The command was later changed to <code class="language-plaintext highlighter-rouge">/g</code>.</p>

<p>To give GB a Swearing-101, I first created a list of the usual dirty phrases that my friends use. It got repetitive after some time, so I found a bigger, open-source list - <a href="https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words">Our List of Dirty, Naughty, Obscene, and Otherwise Bad Words</a>. I took Hindi and English ones. It was not repetitive anymore, but there were many words in that list which are generally bad, but do not make sense in our context. Consequently, sometimes the random gaali that GB replied with, did not make sense at all. At times it was funny - the dumbness of the bot, not the usage of the wrong gaali - but usually, it denied us the hilarity that would have come if the <em>right</em> abuse had come. So, we culled the list; added some more forms of the hindi abuses. Now the results were better. My friends occasionally bring/invent new phrases, so this list keeps getting longer.</p>

<p>There is also one easter egg in the /g. If someone tries to use /g to abuse itself (using <code class="language-plaintext highlighter-rouge">/g GB</code>) then GB retaliates by abusing the user himself.</p>

<p><em>Unsolved Problem 1</em>: There is still one hiccup with the current list. This list contains various forms of abuses - singular, plural, masculine, feminine, <word>-ing, and other forms - in both Hindi and English. Hence, when the command is given, in what context which expletive should be selected is random. Sometimes, it fits, sometimes it makes no sense. Although, it's not on priority now, I think, this should be the part of the *intelligence* of the bot.</word></p>

<p><em>Unsolved Problem 2</em>: Some words can be both - expletives and non-expletives.. Currently, every use will be counted as a bad word, which is wrong. For ex. the <a href="https://en.Wikipedia.org/wiki/Do_Androids_Dream_of_Electric_Sheep%3F">Do Androids Dream of Electric Sheep?</a> author’s name - <a href="https://en.Wikipedia.org/wiki/Philip_K._Dick">Philip K. Dick</a> - contains the word dick, which we all know is a bad word. In the current scenario, using the name will increase the count, but it should be counted.</p>

<h3 id="ashleellaundakaun-ashleel-laundia-kaun">Ashleellaundakaun (Ashleel laund(i)a kaun)</h3>

<p><em>Who is the dirtiest guy?</em> That is what <em>ashleel laund(i)a kaun</em> means. GB constantly monitors the chats and maintains a count of expletives used by each user which we can retrieve using <code class="language-plaintext highlighter-rouge">/ashleellaundakaun</code> (later renamed to <code class="language-plaintext highlighter-rouge">/a</code>). It’ll also use @mention to mark the topper so that a notification goes to that user specifically.</p>

<p>This part was somewhat challenging, mostly because of a lot of edge cases. The basic implementation was just an exact string match of the reply in the vocabulary. I’ll list the edge cases and their solutions-</p>

<ol>
  <li><strong>Limited vocab</strong>. So not every cuss word was being counted - inserted the vocabulary of dirty words from the <a href="https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words">Github repo</a> to make a master list. Initially this list and the list used in the /gaali command were different, later, they were combined.</li>
  <li><strong>Profanities could be mid-sentence</strong> - split the reply into words and then check in the list.</li>
  <li><strong>Multiple word abuses</strong> - prepare bi-grams and tri-grams from the reply and then check each one in the master list.</li>
  <li><strong>Single word abuses split into multiple words</strong> - elide the n-grams (n goes from 1 to the number of words in the reply) into single words and then check against the master list. Eg. “mother bugger” to “motherbugger”</li>
  <li><strong>Making a cuss word unnecessarily long</strong> by repeating the characters - reducing the repeating characters into a single character. Eg. “shiiiiiiitt” to “shit”.</li>
  <li><strong>Using accented characters in an abuse</strong> - normalizing the characters to ascii characters. Eg. “shít” to “shit”</li>
  <li><strong>Using Devanagari (raw Hindi) instead of transliterated Hindi</strong>. Gboard provides keyboard for Indic languages which enables you to write in transliterated Hindi and it’ll give you the Devanagari version. (Did I mention that my friends are good at finding workarounds?) - Here we made an elementary transliteration engine (a bit better than elementary). Detect whether the word in the reply is written in Devanagari, if yes, then transliterate it to get top 10 transliterations and check if anyone of those is in our master list.</li>
  <li><strong>Wrongly spelled abuse</strong> - while building the spelling corrector (will discuss in the next section), added the part that, if a corrected word was in the master list then increase the count for that user.</li>
</ol>

<p>After this much work on this one command, one friend exclaimed -</p>

<blockquote>
  <p>This must be the most exhaustive gaali detector for hinglish ever made.</p>
</blockquote>

<p>While working on the transliteration bit, I got to know a few things about the world of transliteration and specifically, about Hindi transliteration.</p>

<ul>
  <li>There is not a single transliteration. Colloquially, there are multiple transliteration versions in use.</li>
  <li>There are standards of transliteration systems which try to give unambiguous transliterations, but it’s still challenging because some sounds are not present in the English so they have to use dots or capital and small alphabets.</li>
  <li>Transliteration can be between any two language scripts. Transliteration to English is called Romanization. So here we are talking about Devanagari Romanization.</li>
  <li>There is not much present - both research and data - on the transliteration of Hindi into English. There are one or two labs where some research activities are being done on Hindi language, but not sure if transliteration is a part of their research.</li>
  <li>Google has a deprecated API (but still functional) which returns top n transliterations (or all, if parameter given).</li>
  <li>There are some ways of training neural network models to get a transliteration engine. But they need labeled data to train. Which is hard to get by. However, in a <a href="https://developer.amazon.com/blogs/alexa/post/ec66406c-094c-4dbc-8e9f-01050b27d43d/automatic-transliteration-can-help-alexa-find-data-across-language-barriers">recent research by Amazon Alexa researchers</a> working on <em>named-entity transliteration</em>, the training data was found in Wikipedia data. They took advantage of the fact that a person’s wiki page usually contains their name in multiple languages. Thus, they have a mapping from between 2 scripts. They didn’t do it on Hindi though.</li>
</ul>

<p>We’ll come back to transliteration.</p>

<h2 id="spelling-correction">Spelling Correction</h2>

<p>We all know what spelling correction is. It’s usual to make a few typos in a conversation. So, I had the <em>brilliant idea</em> of building one from scratch. Since, it’ll be for Hinglish and not just for English, I thought, it’d be fun. Let me tell you how fun it was.</p>

<p>I only knew about the dictionary based spelling correctors. You’ll have a dictionary of possible words, against which, you’ll check a given word. If it’s there then good otherwise you have just caught a wrong spelling. Now you have to correct it. Correction can be done using a string similarity metric. I don’t like this approach though. So, I googled how to build one. The first result was a <a href="https://norvig.com/spell-correct.html">toy Spelling Corrector</a> written by <a href="https://en.Wikipedia.org/wiki/Peter_Norvig">Peter Norvig</a>. I had seen this post before, but somehow, I didn’t remember the details (no wonder I felt the need for /yaaddila command in GB). The best part of the post was that he had presented it in just 36 lines of code achieving around 70-75% accuracy. I integrated it within GB.</p>

<p>For GB’s correction engine, I used the biggest file Dr. Norvig had used in his toy corrector. This was just the English data - around 6.2 MB in size. To correct Hindi spellings, I needed transliterated Hindi data. I had a lot of Hinglish data lying around from my <a href="/2018/04/chatting-up-2/">FB/WP chats</a>. Additionally, I asked my friends to give me some more. There was around 4 MB of Hinglish data in total. So I had 2 language text sources - <code class="language-plaintext highlighter-rouge">english.txt</code> and <code class="language-plaintext highlighter-rouge">hindi.txt</code>. And, I made the system live. lol.</p>

<h3 id="flaws">Flaws</h3>

<p>I was very fortunate that my friends didn’t kick me and GB out of the group. How ironical is this, the thing that is supposed to correct your mistakes is making mistakes more than you. On some of the obvious cases it worked as intended, but on the others, it failed miserably. There was an incident when one of my friends got so annoyed that he cussed at GB and GB in return corrected the cuss-word into something ridiculous. With daily usage, we started seeing the flaws in the correction engine.</p>

<ol>
  <li>
    <p><strong>Small Vocabulary</strong>: We were lacking in both English &amp; Hindi data. For example, our discussions on group largely revolve around computer science and technology. And, GB constantly faltered on such CS related words. Examples -</p>

    <table>
      <tbody>
        <tr>
          <td><em>hosting</em></td>
          <td>-&gt;</td>
          <td><em>costing</em></td>
        </tr>
        <tr>
          <td><em>browser</em></td>
          <td>-&gt;</td>
          <td><em>brother</em></td>
        </tr>
        <tr>
          <td><em>corrector</em></td>
          <td>-&gt;</td>
          <td><em>correct</em></td>
        </tr>
        <tr>
          <td><em>dedup</em></td>
          <td>-&gt;</td>
          <td><em>deep</em></td>
        </tr>
        <tr>
          <td><em>dict</em></td>
          <td>-&gt;</td>
          <td><em>duct</em></td>
        </tr>
        <tr>
          <td><em>concat</em></td>
          <td>-&gt;</td>
          <td><em>coat</em></td>
        </tr>
        <tr>
          <td><em>linux</em></td>
          <td>-&gt;</td>
          <td><em>line</em></td>
        </tr>
        <tr>
          <td><em>notification</em></td>
          <td>-&gt;</td>
          <td><em>ratification</em></td>
        </tr>
        <tr>
          <td> </td>
          <td> </td>
          <td> </td>
        </tr>
      </tbody>
    </table>

    <p>So I added a few Wikipedia pages on CS topics - Data Structures and algorithms, Machine learning related pages, etc. I also added a few wiki pages having information about India - Indian sub-continent, Indian governmental bodies, Indian dishes, etc.</p>

    <p>Since this vocabulary enhancement was going to be a recurring activity, I picked Wikipedia as my base source. Whenever there was some English mistake by GB, I looked for that page or the page containing that word and add it in the base file. This had a few advantages</p>

    <ul>
      <li>I got different forms of the same word;</li>
      <li>I also got other new words;</li>
      <li>I got sentences where this word is in use (this will help in n-grams based approach discussed later)</li>
    </ul>

    <p>For Hindi, I tried to find some blog post having that word and added that to the <code class="language-plaintext highlighter-rouge">hindi.txt</code>. Otherwise, it went to a new file <code class="language-plaintext highlighter-rouge">newvocab.txt</code>.</p>

    <p>Manually adding the vocabulary was a drag. Find the wiki page, copy the content to the <code class="language-plaintext highlighter-rouge">englist.txt</code>, recalculate the word frequencies and then restart the bot. So I made a command - <code class="language-plaintext highlighter-rouge">/new</code>, which when given a keyword, searches for the keyword on Wikipedia, if found, adds the vocab to the the base file and then recalculates and reloads the frequency counts for the bot. If word was not found on wiki, added the words to the <code class="language-plaintext highlighter-rouge">newvocab.txt</code> and reloaded the freq counts. This made things easy for me.</p>

    <p>To take it a step further, now, whenever GB gives a correction for something, it also gives 3 buttons along with it. With the help of this feedback, I am also able to collect the corrector evaluation data.</p>

    <ul>
      <li><em>Thumbs up</em> means the correction is right;</li>
      <li><em>Plus</em> means the correction is wrong because the word is not in it’s vocabulary and adds it to the <code class="language-plaintext highlighter-rouge">newvocab.txt</code>;</li>
      <li><em>Thumbs down</em> means the correction is completely wrong and word is also wrong so it should not be added to the vocab as well.</li>
    </ul>

    <p><img src="https://trigonaminima.github.io/assets/2018-10/correction_feedback.jpeg" alt="correction_feedback" /></p>
  </li>
  <li>
    <p><strong>Nonsensical Corrections</strong>: The <code class="language-plaintext highlighter-rouge">english.txt</code> have some Sherlock Holmes stories and some Shakespeare plays too. There were many names, some classical English words. In <code class="language-plaintext highlighter-rouge">hindi.txt</code>, there was long tail of words with frequency 1, many of which were wrong spellings or rarely used Hindi words. All these infrequently used words and names and wrong words, get mixed with the words (both new and wrongly spelled) used during the conversation and gave a lot of nonsensical corrections. For eg-</p>

    <ul>
      <li><em>paagal</em> -&gt; <em>paagl</em> (Hindi for crazy; paagal is correct; paagl is nonsense)</li>
      <li><em>tminima</em> -&gt; <em>minims</em> (tminima is my username on Telegram; don’t know what minims is)</li>
      <li><em>raspbot</em> -&gt; <em>rasbt</em> (no idea what rasbt is)</li>
      <li><em>thissss</em> -&gt; <em>issss</em> (should have been corrected to <em>this</em>; issss is nonsense)</li>
      <li><em>fneoe</em> -&gt; <em>fnese</em> (both words are nonsense)</li>
    </ul>

    <p>The solution to this was filtering out the frequency 1 words from the vocabulary. This worked better than I had expected.</p>
  </li>
  <li>
    <p><strong>Sanitization and Tokenization</strong>: Due to a very basic text sanitization (stripping of white spaces and special characters) and tokenization (simple split on spaces and other punctuation) there were many unwanted corrections.</p>

    <ol>
      <li>
        <p>It corrected parts of urls. eg. in <em>https://github.com/dwyl/english-words</em> it corrected <em>dwyl</em> to <em>dwy</em>. This was a nonsensical correction of an url. Incredible. Added a regex to remove urls from the text. Similarly a regex based filtering was done for the emails, reddit subs, twitter handles and hashtags.</p>
      </li>
      <li>
        <p>It corrected strings like <em>hahahahahah</em> to the closest match. Added regexes for such common strings - looooool, dayyyyyyum, etc - to replace them with a standard version. In this case replaces <em>hahahahahah</em> by <em>haha</em>.</p>
      </li>
      <li>
        <p>Corrections for the numbers and words starting or ending with numbers. Example, it corrected <em>3C2</em> (Permutation and Combination notation for choosing 2 out of 3) to <em>32</em>. Removed all numbers and words with numbers in them from the vocabulary.</p>
      </li>
    </ol>
  </li>
  <li>
    <p><strong>Morphology</strong>: Due to the absence of different forms of some words, they got corrected to the form that was present in the language model. Eg -</p>

    <table>
      <tbody>
        <tr>
          <td><em>standardised</em></td>
          <td>-&gt;</td>
          <td><em>standardized</em></td>
        </tr>
        <tr>
          <td><em>phonetics</em></td>
          <td>-&gt;</td>
          <td><em>phonetic</em></td>
        </tr>
        <tr>
          <td><em>conversions</em></td>
          <td>-&gt;</td>
          <td><em>conversion</em></td>
        </tr>
        <tr>
          <td><em>generated</em></td>
          <td>-&gt;</td>
          <td><em>generate</em></td>
        </tr>
        <tr>
          <td> </td>
          <td> </td>
          <td> </td>
        </tr>
      </tbody>
    </table>

    <p>For now, I have handled this in a very hacky way. For each correction and original word pair, I remove the suffix (from a list of suffices) and then check if they are same thus rejecting the correction. This way of handling the morphology is very fragile. I had to handle the possessive/non-possessive forms differently. I’d like to work on this more in future.</p>
  </li>
</ol>

<p><em>Unsolved Problem 3</em>: It is not able to correct the word which exists in the vocabulary, but its usage was incorrect. Ex. in the following sentence, <em>you know shell be fine</em>, every word is correct if you look at the spellings, but usage of shell is wrong. It should be <em>you know she’ll be fine</em>. In another example, <em>the ONLY issue is, ki word cound ka indication kaise hoga!</em>, “cound” is wrong word and doesn’t exist in the vocabulary, but GB corrected it to <em>could</em> instead of <em>count</em>. This indicates that context is important for us. Dr. Norvig placed a very high trust in context based approach. He <a href="http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb">discusses</a> a n-gram based approach to use context. I am yet to implement it, but it should solve this problem.</p>

<p><em>Unsolved Problem 4</em>: The error model in the spelling corrector is pretty basic - select which is most probable in the valid words with edit distance 1; if there are no valid words then move to the words with edit distance 2; if there’s no such words then do not correct it. Here first preference is given to the words with edit distance 1 even if the correct spelling is in the words with edit distance 2. Correction to be given is using this error model. For ex. have a look at the following cases-</p>

<ul>
  <li>
    <p>Wrong spelling: reciet
  Actual spelling: receipt
  GB’s spelling: recite</p>
  </li>
  <li>
    <p>Wrong spelling: adres
  Actual spelling: address
  GB’s spelling: acres</p>
  </li>
  <li>
    <p>Wrong spelling: yeau
  Actual spelling: yeah
  GB’s spelling: year</p>
  </li>
  <li>
    <p>Wrong spelling: usje
  Actual spelling: uske (“uske” means “his”)
  GB’s spelling: use</p>
  </li>
</ul>

<p>In the case of <em>adres</em>, the error model can be updated such that the two edits of “d” to “dd” and “s” to “ss” should have high probability than the single edit of “d” to “c”.</p>

<p>The error model can be updated by taking into account the layout of qwerty keyboard. In the case of <em>yeau</em>, “h” is closer to “u” then “r” is to “u” and thus the change from “h” to “u” should have high probability. Similarly, in <em>usje</em>, “k” is adjacent to “j” so it should be given higher probability than the deletion of “k”. I’ll look into this, once I have implemented the n-grams based approach.</p>

<p><em>Unsolved Problem 5</em>: It’s not able to handle contractions properly. Basically, I want to correct <em>cant</em> to <em>can’t</em>, <em>hes</em> to <em>he’s</em> and so on. The tokenizer extracts these words from the base data, but the error model doesn’t give the right results. I haven’t investigated the results in deep yet, but I am hoping, it should be solved by n-grams. Other way can be, there is a fixed list of contractions, so each of them can be individually matched and corrected.</p>

<h3 id="implementation-challenge">Implementation Challenge</h3>

<p>My workflow while working on GB is to code and test it on my personal laptop and then push it to server for deployment. Now there’s a major difference between the 2 machines - my device has 8 GB RAM, 12 GB Swap, 4 cores and the server has 512 MB RAM, 216 MB Swap, 2 cores. I am not getting into the reasons, but I had to make do with this.</p>

<p>Initially, for the spelling correction, the <code class="language-plaintext highlighter-rouge">english.txt</code> and <code class="language-plaintext highlighter-rouge">hindi.txt</code> were processed every time the bot started, frequencies were calculated and then it was ready to make corrections. When the new vocabulary started adding up this process got slow and consumed more memory. I was okay with this as it was done only once during the bot startup. One day, data increased to the limit that the kernel killed the process during the launch. That day I learned, the linux kernel kills a process if it does not have enough resources for it.</p>

<p>I checked the data file, it was hardly 30-40 MB in total. Loading it should not make the server go out of resources. I was keeping all the loaded data in the global scope. I defined counter prep function which returns the prepared counter only. Thus deleting all the data loaded automatically when the function ended. Then, I made the process iterative. So that it reads, processes and updates the counter, one line at a time. This made things better.</p>

<p>To make the process more efficient, I pickled the prepared counter and whenever bot restarted just loaded that. This eliminated the data processing on every bot startup. And this worked very smoothly. It only had one flaw - whenever the vocabulary was updated I had to restart the bot to make the new vocabulary come into effect.</p>

<p>For updated vocabulary to take effect without restarting the bot, I made a class and added vocabulary update methods. At a time, there’s only a single object of this class in the memory. So whenever a new Wikipedia page was added, it added the text to the <code class="language-plaintext highlighter-rouge">english.txt</code> and also updated the counter currently in the memory and updated the saved pickle. So, all the updates are iterative now, improving the vocabulary and bringing it into effect at the same time without restarting the bot.</p>

<p><em>Unsolved Problem 6</em>: Even now, it loads the prepared counter in the memory. As the size will increase, it’ll take more time. I haven’t yet thought about how to work through this issue - make it use less memory. Although, it’ll take time to reach that size. So this issue has taken a back seat.</p>

<h2 id="data">Data</h2>

<p>Lets talk about data. If you read the previous section on spelling correction, then you would know, we need more data, and particularly, more Hindi (or Hinglish) data. So that we have a good Hindi vocabulary along with good English vocabulary. Most of the wrong Hindi corrections are because of the absence of that word from the GB’s vocabulary.</p>

<p>Towards this vocabulary building goal, here is the list of sources I have included till now;</p>

<ul>
  <li>
    <p>All the articles of this blog - <a href="http://memsahibinindia.com/">memsahibinindia</a>. It’s mostly written in English, but the author talks a lot about India and describes the dishes and places of India. She also uses Hindi words at times.</p>
  </li>
  <li>
    <p>All the articles of this blog - <a href="https://blogs.transparent.com/hindi/">blogs.transparent.com/hindi</a>. This blog aims at explaining Hindi words in English by using them in a narrative. It will first list the English translation of the word followed by writing it in Devanagari, and within brackets, the transliterated version of the it. Since the target audience is the Hindi learners, it tries to introduce many new Hindi words, in effect, giving us more Hindi vocabulary.</p>
  </li>
  <li>
    <p>Hindi song titles and lyrics. I googled for some Hindi song titles and lyrics dump. Found a few, but they were all transliterated using some transliteration standard which was not same as our usual transliteration. So word tokenization would have given some unwanted words. Filtering on this basis, I was only left with one dump and added it in the <code class="language-plaintext highlighter-rouge">hindi.txt</code>.</p>
  </li>
  <li>
    <p>Scraping the comic headlines and the panel text from <a href="http://www.amul.com/m/amul-hits">Amul comic panels</a>. Amul is one of the biggest milk supplier in India and almost daily they release a comic in the newspaper where they take the headline or some major event and turns it into a pun using their products. A lot of it is in Hinglish. Recently, I stumbled upon the page where they have all these captions and comics digitized. I scraped all the puns and ingested it into GB’s vocabulary.</p>
  </li>
</ul>

<p><em>Unsolved Problem 7</em>: Twitter is another source of Hinglish data, but to sift through all the tweets to look for Hinglish ones is drudgery. I thought of an approach where I get all the tweets of all the Indian comedians active on Twitter. They occasionally tweet in Hindi. Other than that, I also observed that when some new movie comes, many users tweet in Hindi using that movie’s hashtag. Getting all the tweets under those hashtags is another way of getting the data. I haven’t started working on this yet.</p>

<p><em>Unsolved Problem 8</em>: I didn’t get an exhaustive list of all the Hindi songs - the titles as well as the lyrics. There are some websites having this data. I need to code a scraper to acquire all this information. Since there are A LOT of songs to scrape, this scraper will take some time to code.</p>

<p><em>Unsolved Problem 9</em>: Lets come back to the transliteration. A good Hindi to English transliteration engine can also help us acquiring more data. It’ll give us more Hindi data for spelling correction (and other NLP things). This is not very high on the priority list. A few ways of attacking at the problem:</p>

<ul>
  <li>Build a similar dataset like the Alexa researchers from Wikipedia for the names and build a transliteration model on it.</li>
  <li>Instead of just using the person names, also take other concepts like fruits, vegetables, places and other item/topic pages, and then train some translation or sequence-to-sequence model on it.</li>
  <li>Create a list of words written in Devanagari from Hindi Wikipedia or other sources from the web and then for each word, using the Google’s deprecated API, get all the transliterations for that word. Then create some kind of scoring function to trim that list of transliterations and then transliterate the Hindi literature using these base transliterations.</li>
</ul>

<p><em>Unsolved Problem 10</em>: Scoring mechanism to reduce the number of transliterations. Google’s API or our own code or any other model, all would (and probably should) give a set of possible transliterations for a Hindi word. So we’ll need some scoring metric to only keep the most plausible ones among the possible transliterations, or to reduce the list to a top 1 or 2 suggestions. One approach, I have for this, is to build a Markov Chain from the current Hindi data and then determine the path score of each possible transliteration and then select the top 1 or 2. I have no idea how good of a scoring metric this is.</p>

<h2 id="logging">Logging</h2>

<p>Up until August, there was no logging present in GB. Only text/JSON files maintaining the basic data required. Then I integrated sqlite with GB. Designed the table structure. Created the <code class="language-plaintext highlighter-rouge">CHATS</code> table. Schema has a lot of things -</p>

<ul>
  <li>Message sent (string);</li>
  <li>Person who sent the message (string);</li>
  <li>Group name or the individual name if it’s a personal chat (string);</li>
  <li>Was it a command (0 or 1);</li>
  <li>Was this a reply from the bot (0 or 1);</li>
  <li>Was this message quoting another message (0 or 1);</li>
  <li>What’s the quoted message (string);</li>
  <li>Number of links in the message (integer)</li>
  <li>What are the links (string);</li>
  <li>How many gaaliya (abuses) were there (integer);</li>
  <li>What gaaliya were mentioned in the message (string);</li>
  <li>How many corrections did GB suggest (integer);</li>
  <li>What were those corrections (string);</li>
  <li>Vocabulary counts whenever <code class="language-plaintext highlighter-rouge">/new</code> is used - total words, new words added, total unique words (integers);</li>
  <li>Code switching positions - not being used currently (string);</li>
  <li>Raw json of the message sent by Telegram API (json string).</li>
</ul>

<p>Obviously, other fields will be added as new features will be introduced. For ex., I am not logging the feedback sent using the buttons on the spelling corrections made by the GB.</p>

<p>Additionally, <code class="language-plaintext highlighter-rouge">CHATS</code> table is not even in the <a href="https://en.wikipedia.org/wiki/Database_normalization">first normal form</a> (or 1NF). The gaaliya, links, and corrections fields have pipe separated strings. To get the individual elements of the list, I have to split by pipe in code. Thus the atomicity property is not fulfilled. As the size will increase, I might think about normalizing it. Currently it’s not needed.</p>

<h2 id="weekly-stats">Weekly Stats</h2>

<p>I have added a few stats which are displayed weekly. There is one for each day. Logging of the chats made things straightforward to query for these stats.</p>

<h3 id="gaaliya">Gaaliya</h3>

<p>Every Monday, it lists the abuses each person gave during the last week. It also tags, using the @mention, the one with the highest count.</p>

<p>Now when this gaali counting started, we all were very active in profanity (of course, me being the least active). Now, two out of three members (2 friends and I) of the group, have almost completely stopped using such language in the group. We now use other hints and innuendos which everyone gets. For ex. now whenever we want to use some bad word, we use <code class="language-plaintext highlighter-rouge">/g</code> in it’s place. And everyone gets it. This usage to me is really amazing. How some man made construct is used naturally in our conversations and everyone understands the meaning associated. Anyway, the third guy who still cusses a lot (although, it’s lesser now) is more-or-less consistent with his abuse counts. I’d like to believe that this weekly notification helped us decrease our use of such language. :D</p>

<h3 id="wordcloud">Wordcloud</h3>

<p>Every Tuesday, we get a wordcloud from the messages of the last 4 weeks. This required a bit of work. I had to build a Hinglish stopword list to remove unwanted words from the wordcloud.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-10/wordcloud.jpeg" alt="wordcloud" /></p>

<h3 id="links">Links</h3>

<p>This lists, every Wednesday, the number of links shared by each person during the last week. I don’t know, how is this useful. I just implemented it.</p>

<h3 id="corrections">Corrections</h3>

<p>This lists, every Thursday, the number of corrections GB made for each person in the last week. This might help in measuring whether GB helped reduce the mistakes in the group. Or putting it in other way, how effective is GB in helping us achieve a certain efficiency or improvement. Not that we care about this quantification, but I am just giving an example of what can be done.</p>

<h3 id="commands-1">Commands</h3>

<p>Every Friday, we get the number of commands given by each person to GB during the last week. The use of commands has declined now according to these weekly stats. Rarely does anyone interacts with GB using commands. Mostly <code class="language-plaintext highlighter-rouge">/g</code> is used now.</p>

<p>I think, using commands ought to be like this. Commands are sort of prohibitive. You need to learn what each command does. You have to be specific with the arguments while using the command. It is close to the <em>texting culture</em> but does not feel like an integral part of the culture. For example, for wrong corrections, GB has both, <code class="language-plaintext highlighter-rouge">/new</code> command and the feedback buttons. But even amongst the three of us, only I used the <code class="language-plaintext highlighter-rouge">/new</code>. Whereas, the feedback buttons have somewhat blended with the normal flow of the conversation and we all use it whenever they pop up.</p>

<h3 id="quotes">Quotes</h3>

<p>Every Saturday, it lists the times how much one person quoted another. You can see how it is done currently.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-10/quoting.png" alt="quotes" /></p>

<p>When I had thought about this, I was imagining a graph, like <a href="https://bl.ocks.org/mbostock/4062045">D3 force graph</a>. I am yet to work on this part.</p>

<h3 id="messages">Messages</h3>

<p>Every Sunday, we get the user wise counts of messages we have sent on the group in the last week. GB is also included in the list of users.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I have the basic infrastructure ready for the bot. I have a database which logs everything being spoken on the group. I have built a fairly decent spelling correction. There are many other improvements in the pipeline to be done. There are some interesting problems to solve. And, I think, this is just scratching the surface.</p>

<p>There are a lot more advanced things which I’ll be working on now. Sentiment Analysis, Entity Recognition, NLP, Deep Learning, Knowledge Graph and other interesting stuff. I have given a cursory attempt on training a word representation model (Word2vec) on the Hindi data. Didn’t work that well. Also tried an LSTM to make the bot say something on it’s own. It also failed. Will give these another go after getting a better understanding.</p>

<p>While working on this project, I also delved on a more philosophical topic. <em>How can these chat bots be used effectively?</em> We started off with GB to get some useless features to make the group <em>fun</em>. But now, whenever, I think of making it useful to us, I cant think of anything amazing to implement. Sure there is that command interface which we can use to make it do anything in our power to code. We can also ask it to monitor some processes running on a machine. We can make it track some events - conferences, hackathons, etc. We can also ask it control some IoT devices. But are these functionalities making things effective for us or are they just the replacement of the old/traditional ways? Having this bot available on our smartphones helps. But are we using this enhanced accessibility to our advantage?</p>

<p>I am also conflicted about how a bot should function. A command based interface where everything is to be predefined or a more blended interface where it just works whenever it’s needed without explicitly calling it. Latter one seems more natural, but then it strips away control from the users. Former path seems prohibitive. May be, having both - a command interface and non-command interface - will feel better and blend better. I have mostly worked on the command interface till now. Going forward, I’ll experiment with some non-command based functionality.</p>

<p>I believe that an intelligent system can properly be made with the combination of Natural Language Processing (NLP) and Reinforcement Learning (RL). I have zero idea of how RL works, but I think, if I am able to improve a model using the feedback it receives, then I am employing RL. In normal RL systems there’s an objective function as a goal here we don’t have any such goal except working satisfactorily to get a positive feedback from the user. I’ll also focus on having such systems in place and iteratively making the models better in GB.</p>]]></content><author><name>Shivam Rana</name></author><category term="NLP" /><summary type="html"><![CDATA[In a Telegram group of three, a fourth member was added - GB - a Telegram bot. GB is the most useless, unavailing and noisy bot there is, but it’s fun. I started working on it because, I have always been interested in building chat bots, but never had the chance. Secondly, and this is a rather ambitious thinking, I want to build a NLP system able to understand and act on the natural language. Lastly, this was more of a realization when I started my work on GB, opportunity to build NLP tools and techniques for Hindi+English (or Hinglish). In the process, I’ll be working on both basic NLP and some advanced NLP techniques. And, I’ll try to implement everything from scratch.]]></summary></entry><entry><title type="html">New Job, New City</title><link href="https://trigonaminima.github.io/2018/09/new-job-new-city/" rel="alternate" type="text/html" title="New Job, New City" /><published>2018-09-16T00:00:00+00:00</published><updated>2018-09-16T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2018/09/new-job-new-city</id><content type="html" xml:base="https://trigonaminima.github.io/2018/09/new-job-new-city/"><![CDATA[<p>In my adult life, I have never lived outside Delhi for more than a month. I came here at the start of 6th Standard (or Grade). During my 12 Standard, I used to daydream about going to a college outside Delhi. I’ll live in a hostel and enjoy the broke but independent life of a hosteler. Little did I know that I’ll be doing my undergrad from a college in Delhi itself, as a day scholar. Then came the placement season. This time again, I thought I’ll get to go outside my home city. And, this time I wont be (too) broke either. Yet again, I got a job in Noida, very close to Delhi. Now, after working for 2 years in Delhi, I have finally moved to a different city - Bangalore - also known as <em>City of Gardens</em> and <em>City of Lakes</em> and in tech industry, the <em>Silicon Valley of India</em>. I have joined as a Data Scientist in <a href="https://en.wikipedia.org/wiki/Jio">Reliance Jio</a>.</p>

<p>I reached Bangalore in the evening of July 29th. Stayed with a friend for the night. I had to join the very next day. Till the night of 29th, I had no tensions in life regarding this move. I was happy about this new independence (not that there were any restrictions at my home). I had this view in my mind - moving to the hotel the next day; I have 15 days to look for a permanent place and, 15 days are more than enough; it’ll will be a breeze. The friend I was staying with also said that flat searching is easy (He had done it a month before). Install <a href="https://www.nestaway.com/">Nestaway</a> and you’ll be sorted (spoiler alert: Nestaway didn’t help).</p>

<p>Next day (30th Jul), normal on-boarding process followed. During the lunch, I met a product manager who also joined the same day. She was here in Bangalore for the past 12-13 years. Now talking to her, I realized I hardly have any idea where or how should I get a flat. What area to look for. Which are the good areas. Broker or no-broker. Sharing or non-sharing. What is “normal” house rent. House maintenance. What are typical advanced deposits. Fully furnished or semi or none. She had set the ball rolling.</p>

<p>After lunch, a teammate showed me the suitable areas around the office on Google maps. The big no-no areas and what are the places which will be commute friendly. My hotel was very close to my office, around 1 KM. On my way to the hotel, I had two action points in my mind. First, try to find something within 1-2 KMs of the office. Second, if nothing is there, then move to the areas around the metro line. Since office is pretty close to the metro station, I wont be stuck in the Bangalore traffic during my daily commutes.</p>

<p>On 31st July, during the on-boarding presentation, I was shortlisting the suitable flats on 3-4 apps I had installed yesterday. During the lunch, another person who joined with me yesterday, suggested me the Flat and Flatmates group on Facebook. I had completely missed it till now. Then during the second half of the training, I looked for 2-3 such groups for Bangalore and sent the joining request for ‘em. Training got over early and I went to checkout some flats near my hotel. I needed to start somewhere. So, I decided to check some flats from the apps.</p>

<p>This “living on my own” feeling being all new to me, made me feel excited. I was worried about not getting anything yet, but I was still excited. Anyway, I started exploring the nearby areas. I had 2 flats on my list that day. On my way to one of these, I decided to check the costs of renting one of the flats from the area I was passing from. Now this area is one of the expensive areas of Bangalore. So I knew that the costs will be high, but I still went, you know, in the name of “locality exploration”. Anyway, I went and asked the society guards, the costs of renting out a room here. The guy said, they don’t rent out rooms here. I was like, okay, how much for a whole flat? He said, ₹1,20,000. I, shocked now, confirmed whether this was the deposit amount. He denied and replied it was the monthly rent. I again confirmed if this was the annual rent. He again said no and added it was monthly. I noped the fuck out of that place and never tried it again. The first property’s owner asked me to come in the next morning. I don’t know what was up with the second property. I went to the location then called the owner, but the owner disclaimed the property.</p>

<p>That night, I was accepted to one of the flatmates group on FB and I started gleaning through the posts on my phone. I had formed a workflow - check the house details, if they are satisfactory, check its distance from the office. Then I woke up the next day with the phone still in my hand.</p>

<p>This day (1st August), was also the training day. Before going to the office, I went to the house which I couldn’t see the last night. The locality and the house was alright, but the rent was a bust. The actual rent was double the amount listed on the app. I couldn’t afford that. This house was so close to the office. So, I went disappointed to the training, now really worried. During the training, I was frantically reading posts on the group. Checking the distance of the location from the office. After 1-2 hours, I finally found one. It was very near to the hotel I was staying at. Talked to one of the guys staying there and scheduled a meeting later during the evening. We were done for the day in the 1st half itself. I slept for some time in my hotel room. From early evening, my flat hunting started.</p>

<p>I first walked to another flat shortlisted through the app. This locality was shit. Road leading to the shady street on which the house was located, was covered with muddy puddles (it had rained an hour before). The whole area was stinking. I saw the house from inside and it was worse than the outside. It was freshly painted still it didn’t feel like it was. The whole place gave me an unsettling feeling. I discussed the rent details with the owner, but it was too high for that stinking place. I gave the owner some BS about thinking it over and left. After this, I headed to a PG. I originally wanted a single room with 1(+) flatmates, but thinking, if this PG was satisfactory, then I’ll think about opting for it. Or else, I’ll keep it as a fallback option, if I am unable to get a flat till my 15 day limit at the hotel. So I went there. Now, I have never seen a PG, but I have heard descriptions from friends. So, I was neutral about PGs, but when I saw this one, I was completely grossed out. Here also, there was some foul odor which I couldn’t bear for long. Since, I wanted a single room, owner gave a very high price for it. It was equivalent to the flat I saw which was very near to the office. And this PG wasn’t even worth paying that much. I gave him my offer (basically, ₹2-3K lower than what he was ready to offer). He never called. :p</p>

<p>Later in the evening, I went to the flat which I had found through the Flatmates group. It was within walking distance from the office. The locality was also good. Flat was alright. Flatmates also seemed good. What was lacking was the furnishings. On the post, it was mentioned that the flat was furnished, but it was not. So, if I was going to move there, I’ll need to arrange for a bed, cupboard, table and a mattress. Other than that, everything was there. Monthly costs were going to be slightly higher than my budget, but it was okay. I was prepared for that. This was the first option which I wanted to opt for. I now finally had one good option.</p>

<p>Next day, Thursday, 2nd August, after work, I started reviewing other shortlisted options on the apps. On my way to one of the flats, I saw a church - Sacred Heart Church. After spending 5 minutes in the church, I went to the location of the flat. Damn that locality was filthy. I reached a place where there were 2 temples very close to each other. The flat was located behind one of the temples. This area was disgusting. I didn’t even bother to look at the house from inside. In hindsight, no wonder the rent was that low. After leaving the area, being overconfident, thinking that I can find my way to the hotel, I got lost somewhere near the hotel. In one such street, I saw a mosque. I was kind of tempted to visit it, but I was hungry, so I went to the hotel instead. Later in the hotel, while recollecting the experience of my “Bangalore Tour”, I realized that, today in one go, I saw a church, a temple and a mosque, all in the same area. This was fascinating to me.</p>

<p>On Friday, I was still checking the Flatmates group to get a better option. Out of two prospective flats, one was in Indiranagar and another in Koramangla. I pinged both the guys on the messenger and then continued with my search. I went to see the Koramangla flat during the evening. The flat and the flatmate, both were good, but the commute time was deal breaker. The guys in the Indiranagar flat were not home that night, so I went to see that flat next day. This flat was good. It was fully furnished - beds, almirah, chairs. The room had plenty of space. Flat also had a terrace. This place was close the the Metro Station. My office commute time will be very less through the metro. This place was very close to the ideal. Rent was also within my range. So, I finalized this one. I shifted to this place the next weekend.</p>

<p>And, that’s it. This was my experience with the flat hunting in Bangalore. One thing I observed during my search was, Flatmates group had better options than all the apps. I tried 3-4 apps, but none gave me a satisfactory option, whereas, Flatmates group had a score of 3 by 3. I found three houses and all three were good choices. I was actually wishing that Facebook provided some way of filtering out the post on the group.</p>

<p>One question also popped into my mind, Is there no good solution to find a satisfactory house to rent? Is this problem that hard? There are so many players in this segment still there was no good solution. And this is not just my experience. A few of my friends also found their flats in the similar manner.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[In my adult life, I have never lived outside Delhi for more than a month. I came here at the start of 6th Standard (or Grade). During my 12 Standard, I used to daydream about going to a college outside Delhi. I’ll live in a hostel and enjoy the broke but independent life of a hosteler. Little did I know that I’ll be doing my undergrad from a college in Delhi itself, as a day scholar. Then came the placement season. This time again, I thought I’ll get to go outside my home city. And, this time I wont be (too) broke either. Yet again, I got a job in Noida, very close to Delhi. Now, after working for 2 years in Delhi, I have finally moved to a different city - Bangalore - also known as City of Gardens and City of Lakes and in tech industry, the Silicon Valley of India. I have joined as a Data Scientist in Reliance Jio.]]></summary></entry><entry><title type="html">Hinglish and Transliteration</title><link href="https://trigonaminima.github.io/2018/06/hinglish-and-transliteration/" rel="alternate" type="text/html" title="Hinglish and Transliteration" /><published>2018-06-16T00:00:00+00:00</published><updated>2018-06-16T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2018/06/hinglish-and-transliteration</id><content type="html" xml:base="https://trigonaminima.github.io/2018/06/hinglish-and-transliteration/"><![CDATA[<p>My friends and I recently created a group on Telegram, and so, we thought we should make a bot for fun. With all the useless functionalities in the bot there is also spelling correction. I’ll discuss the spelling corrector in a separate post, but right now I want to talk about Hinglish and transliteration. We have mixed uses of Hindi and English in our chats. Sometimes, our whole sentences are in English. Sometimes, it’s in Hindi. Sometimes, it’s a mixture of both - English and Hindi. The difference was not exactly clear when I started working on this, so let’s discuss how they differ.</p>

<h3 id="what-is-hinglish">What is Hinglish?</h3>

<p><a href="https://en.wikipedia.org/wiki/Hinglish">Hinglish</a> is the <em>dialect</em> (if we can call it that) where the users mix Hindi and English vocabulary to create sentences. We have direct usage of words from both the languages. We also have direct translations of word from Hindi to English and vice versa. Following are the replies from our group chats.</p>

<blockquote>
  <p>Hindi ke liye it makes no sense.</p>
</blockquote>

<p>Pure English: For Hindi, it makes no sense.</p>

<blockquote>
  <p>Bht kam words basically. Mostly names? koi normal convo ka word hai?</p>
</blockquote>

<p>Pure English: Very few words basically. Mostly names? Is there any word which we use in normal conversation.</p>

<blockquote>
  <p>I thought you meant ki main kaamchor hun.</p>
</blockquote>

<p>Pure English: I thought you meant that I am a slacker.</p>

<h3 id="what-is-transliteration">What is Transliteration?</h3>

<p><a href="https://en.wikipedia.org/wiki/Transliteration">Transliteration</a> is the mapping of the script (alphabets or characters) from one language to another. It’s equivalent to spelling a word from a language into another language. A few variations may arise, but overall it remains unambiguous. There is also a process called <a href="https://en.wikipedia.org/wiki/Transcription_(linguistics)">Transcription</a> where spoken language is mapped onto written symbols. Example transliterations from our chat-</p>

<blockquote>
  <p>Baakiyo/Baakiyon</p>
</blockquote>

<p>Pure English: Remaining<br />
Pure Hindi: बाकियों</p>

<blockquote>
  <p>Aajaunga/Ajaunga</p>
</blockquote>

<p>Pure English: I will come<br />
Pure Hindi: आ जाऊंगा</p>

<blockquote>
  <p>Puch</p>
</blockquote>

<p>Pure English: Ask<br />
Pure Hindi: पूछ</p>

<p>I think, in our usage (when spelling Hindi in English) we probably use a mixture of both transliteration and transcription. A linguist will be a better person to answer this though.</p>

<p><a href="https://en.wikipedia.org/wiki/Devanagari_transliteration">Hindi transliteration</a> is also called Devanagari Romanization or as a slang, <em>Romanagari</em>. There are even standards for the transliteration of Devanagari to Roman script. Which was really cool. I didn’t realize until I read the wiki, how big of a field of study, transliteration is.</p>

<p><br /></p>

<p>So, in Hinglish, we use a mixture of English and <em>transliterated</em> Hindi to communicate. And all of this digression occurred when I was trying to find some data for <em>Romanagarized</em> Hindi to be used for the spelling corrector.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[My friends and I recently created a group on Telegram, and so, we thought we should make a bot for fun. With all the useless functionalities in the bot there is also spelling correction. I’ll discuss the spelling corrector in a separate post, but right now I want to talk about Hinglish and transliteration. We have mixed uses of Hindi and English in our chats. Sometimes, our whole sentences are in English. Sometimes, it’s in Hindi. Sometimes, it’s a mixture of both - English and Hindi. The difference was not exactly clear when I started working on this, so let’s discuss how they differ.]]></summary></entry><entry><title type="html">Poverty and Competition</title><link href="https://trigonaminima.github.io/2018/06/poverty-competition/" rel="alternate" type="text/html" title="Poverty and Competition" /><published>2018-06-03T00:00:00+00:00</published><updated>2018-06-03T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2018/06/poverty-competition</id><content type="html" xml:base="https://trigonaminima.github.io/2018/06/poverty-competition/"><![CDATA[<ul id="markdown-toc">
  <li><a href="#the-jargon" id="markdown-toc-the-jargon">The Jargon</a></li>
  <li><a href="#the-research" id="markdown-toc-the-research">The Research</a>    <ul>
      <li><a href="#barriers-to-the-competition" id="markdown-toc-barriers-to-the-competition">Barriers to the Competition</a></li>
      <li><a href="#how-competition-helps" id="markdown-toc-how-competition-helps">How Competition Helps</a></li>
      <li><a href="#real-life-examples" id="markdown-toc-real-life-examples">Real life examples</a></li>
      <li><a href="#competition-assessment-framework-caf" id="markdown-toc-competition-assessment-framework-caf">Competition Assessment Framework (CAF)</a></li>
      <li><a href="#paper-closing" id="markdown-toc-paper-closing">Paper Closing</a></li>
    </ul>
  </li>
  <li><a href="#indian-context" id="markdown-toc-indian-context">Indian Context</a>    <ul>
      <li><a href="#introduction-to-the-act" id="markdown-toc-introduction-to-the-act">Introduction to the Act</a></li>
      <li><a href="#anti-competitive-agreements" id="markdown-toc-anti-competitive-agreements">Anti-competitive agreements</a></li>
      <li><a href="#abuse-of-dominant-position" id="markdown-toc-abuse-of-dominant-position">Abuse of dominant position</a></li>
      <li><a href="#combinations" id="markdown-toc-combinations">Combinations</a></li>
    </ul>
  </li>
  <li><a href="#measuring-the-effects" id="markdown-toc-measuring-the-effects">Measuring the Effects</a></li>
  <li><a href="#interesting-directions" id="markdown-toc-interesting-directions">Interesting Directions</a>    <ul>
      <li><a href="#glossary" id="markdown-toc-glossary">Glossary</a></li>
    </ul>
  </li>
</ul>

<p><br /></p>

<p>Having an interest in using data science for social good requires background in the domains outside of my formal knowledge. I try to read around the problems interspersed around us and, in some cases, even controlling our lives. In many cases we are not even aware of them; we don’t understand them; or think they do not affect us. Recently, one of my friends showed me an offline course in this exact direction. There is this long list of readings to get acquainted with the general themes where social good is required and we can work towards solving them using data sc. Here is the course page - <a href="http://act4d.iitd.ernet.in/act4d/index.php?option=com_content&amp;view=article&amp;id=36&amp;Itemid=44">SIL 802, Winter 2017: Data Science for Development</a>.</p>

<p>I will read this list of papers and try to summarize them here. The first in the series is the paper titled - <strong>Why is Competition Important for Growth and Poverty Reduction</strong> [<a href="https://www.oecd.org/investment/globalforum/40315399.pdf">PDF</a>].</p>

<h2 id="the-jargon">The Jargon</h2>

<p>Lets discuss some commonly used words in the paper.</p>

<ol>
  <li>
    <p><strong>Competition</strong>: By Competition, we mean the Business Competition. The competition between firms to surpass each other in terms of profits, services, products, rates and, customers. There are many factors by which you can measure competitiveness which we’ll discuss later.</p>
  </li>
  <li>
    <p><strong>Competition Law</strong>: Competition is so important for an economy and the markets, that there are even Competition Laws implemented to keep the competition fair and encouraging. It prevents proliferation of bad factors - unfair trade practices, anti-competition behavior, etc.</p>
  </li>
  <li>
    <p><strong>Poverty</strong>: Here, in this post, poverty is contextual. With respect to big companies, early stage start-ups are poor. With respect to tax paying population (mostly urban), non-tax payers (mostly rural) are poor. Then there is that unfortunate category of population who can’t even afford the basic human needs - Food to feed themselves, Clothes to protect themselves and House to peacefully sleep at night. They are extremely poor.</p>
  </li>
  <li>
    <p><strong>Competition Policy</strong>: The government policy that govern the competition in markets. A good competition policy makes the markets competitive and fair for all. A sound competition policy achieves better markets, better investor confidence, and also equal opportunities to SMEs (Small and Medium-sized Enterprises).</p>
  </li>
</ol>

<h2 id="the-research">The Research</h2>

<p>This paper talks about the effect of competition on the poverty and growth, in an economy; the barriers to the competition and; how competition policy is connected with Growth and Poverty. The paper also presents a toolkit - <em>Competition Assessment Framework</em> - to solve a challenging problem - the problem of identifying where competition is weak, and how to foster more effective competition to encourage economic growth and reduce poverty. This toolkit will help policy makers of a developing country to mitigate this problem.</p>

<p>I summarize the points under two sections - barriers and how competition is helpful.</p>

<h3 id="barriers-to-the-competition">Barriers to the Competition</h3>

<p>Many factors influence the level of competition. And they are prevalent specially in developing countries.</p>

<ul>
  <li>Inappropriate government policies;</li>
  <li>Unnecessary market entry and exit barriers;</li>
  <li>Anti-competitive practices by the big firms;</li>
  <li>Markets dominated by big firms with close ties to government;</li>
  <li>Powerful entities blocking necessary reforms;</li>
  <li>Bid-rigging for government provided infrastructure and services;</li>
  <li>Lack of awareness on the part of govt. about the ways competition is being harmed;</li>
  <li>Unsure, on government’s part, to identify where barriers to competition exist.</li>
</ul>

<p>This list is not in anyway exhaustive. All these factors diminish opportunities for growth and innovation.</p>

<h3 id="how-competition-helps">How Competition Helps</h3>

<ul>
  <li>Competition facilitates greater <strong>equality</strong> of opportunity by breaking down the barriers to fair competition;</li>
  <li>Effective competition creates more space for the entrepreneurs and SMEs to grow by <strong>reducing opportunities for corruption</strong>;</li>
  <li>Competitive public procurement <strong>increases the effectiveness of expenditure on publicly provided services</strong>, such as education and infrastructure;</li>
  <li>Competition gives companies <strong>continuous incentives</strong> to make their production and distribution more efficient, to adopt better technology, and to innovate.</li>
  <li>Effective competition <strong>enhances a country’s competitiveness</strong> (ability of its firms to compete in export markets, or against imports in its home market).
Research has found that the existence of a competitive environment in domestic markets is one of the most significant factors promoting the international integration of nations’ industries.</li>
  <li>Effective markets enabling choice, encouraging innovation and providing goods and services at lowest possible prices, lead to <strong>improving living standards of the poor</strong>.</li>
  <li><strong>Small entrepreneurs (including farmers)</strong> benefits if entry-exit barriers are low, if they can purchase and sale at fair prices.</li>
  <li>Effective competition prevents practices like bid-rigging and helps government to provide more/better infra/services from the allocated budget. Recipients of government funded services, usually, low income families, are able to gain further from this.</li>
</ul>

<h3 id="real-life-examples">Real life examples</h3>

<p>The paper discusses the competition in African and Asian countries. It points out the pain points, citing examples from various countries.</p>

<ul>
  <li>The 2005 Report of the Commission for Africa (CfA) pointed out that, in Africa, the “lack of competition in services, such as sea and air transport raise(s) costs significantly”, and suggested that, reforms such as maritime deregulation leading to a satisfactory level of competition “could reduce freight costs by 25-50 percent.”</li>
  <li>A database on media allegations of anti-competitive behaviour in Sub-Saharan Africa for the ten years to December 2004, revealed a wide range of competition concerns in the region. There are frequently reported allegations for everyday commodities such as sugar and flour. Other practices identified included those affecting the prices of inputs needed by manufacturers, and practices hurting farmers as buyers of inputs (e.g. fertilizers and animal feed), and as sellers of outputs such as cotton, tea, coffee and tobacco).</li>
  <li>The role that vested interests can play in restricting competition. For example, when plans were underway in Egypt to develop a competition law, opposition to the plan allegedly was organized by a leading MP who owned a dominant steel mill.</li>
  <li>The strength of competition depends both on the conduct of firms and “the external environment in which they compete, the state of infrastructure, legal framework and the effectiveness of the financial system.” Barriers to competition are often the result of government regulations, and private sector firms frequently find the regulatory burden a major disincentive to doing business. Moreover, cumbersome regulation for starting a business is associated with lower productivity and higher levels of corruption. Such barriers tend to hit small firms the hardest.</li>
</ul>

<p>These are just a few examples from a long list of examples the authors cited. There are examples on price and quality of transport services, privatization of state owned enterprises, telecommunications in Zambia.</p>

<p>There a few examples for the positive effects of competition such as increasing competition among the cell phone service providers in Africa, benefited the small farmers who were able to make better decision of where to market their produce.</p>

<h3 id="competition-assessment-framework-caf">Competition Assessment Framework (CAF)</h3>

<p>Each economy is different, and the factors effecting competition need to be dug out. The paper presents the framework to equip the policy makers with operational tools to identify and assess the nature and impact of competition barriers.</p>

<p>The CAF suggests that valuable insights can be obtained on how competition policy can be applied in the interests of economic growth and poverty reduction, by assessing the state of competition in key sectors of the economy, and, where competition is weak, identifying the causes and possible remedies.</p>

<p>The CAF asks questions in 8 steps. For each theme a conclusion is made which finally helps in deciding what action needs to be taken. The 8 steps are-</p>

<ol>
  <li>
    <p>How to select sectors and markets for assessment</p>

    <p>A sector should be important to economy/consumers and should lie under the possibility of having competition problems.</p>

    <p>The questions address the sector’s role in the economy, its importance for the consumers, evidence of concern about prices or availability of the sector’s products, the record of the sector’s past performance, entry barriers and the level of market concentration in the sector. Some major sectors are <strong>agriculture, construction, distribution, energy, finance, manufacturing, telecommunications and transport.</strong></p>
  </li>
  <li>
    <p>Identify the relevant markets and the competitors</p>

    <p>This section includes a set of questions designed to identify the relevant market or markets in the sector. A sector could contain a number of separate “markets” in the economic sense, and, as the state of competition might vary considerably between them, each must be considered separately.</p>
  </li>
  <li>
    <p>Examine the market structure</p>

    <p>The questions in this section outline how to assess the level of concentration in the market, that is, the market shares of the major participants. While high concentration does not necessarily indicate high market power, it is often a significant factor in controlling market behaviour.</p>
  </li>
  <li>
    <p>Look for barriers to entry</p>

    <p>This step seeks to establish whether there are any significant barriers to entry. The questions examine <strong>natural barriers, strategic barriers, regulatory and policy barriers and gender-based barriers</strong>.</p>
  </li>
  <li>
    <p>Ascertain if government policies or institutions limit competition</p>

    <p>This section reviews the legislation, policies and institutions of governments at all levels (national, state and local) that might adversely impact on the level of competition in markets. These include - licensing restrictions, FDI restrictions and trade barriers. Other points under review are - if state-owned enterprises receive any preferences that might restrict competition by the private sector,  if procurement practices harm competition by not being fair and transparent, regulation of markets, trade policy and industrial policy, etc.</p>
  </li>
  <li>
    <p>Consider vested interests</p>

    <p>Vested interest may be personal, corporate or institutional. In many situations there will be stakeholders who either oppose or favour the increase of competition in a market. The questions in this section seek to identify the objectives, power and influence of these stakeholders.</p>
  </li>
  <li>
    <p><a id="point7"></a>Look for signs of anti-competitive conduct by firms</p>

    <p>Major points under focus are - <strong>abuse of dominance</strong> <sup>[<a href="#dominance">1</a>]</sup> (exploiting consumers and/or excluding competitions), <strong>collusion among competitors</strong> (cartels<sup>[<a href="#cartel">2</a>]</sup>) (domestic cartels are more prevalent), possible impact of <strong>mergers and acquisitions</strong>. The mergers can harm the competition if the purpose of the acquisition was only to eliminate the competitor as a separate organization.</p>
  </li>
  <li>
    <p>Draw conclusions</p>

    <p>Review the conclusions made from the previous steps and form an overall view on the current state of the competition in the country/region.</p>
  </li>
</ol>

<h3 id="paper-closing">Paper Closing</h3>

<p>Everyone should understand the beneficial impact of effective competition and of competition policy on an economy. Competition can help encourage both domestic investment and FDI, because it bolsters investor confidence by setting a consistent framework within which the business sector operates. More effective competition reduces opportunities for corruption and rent seeking, and creates more space for entrepreneurs and SMEs (Small and Medium Enterprises.)</p>

<p>Just having a good law is not enough. The introduction of a competition law needs appropriate steps:</p>

<ul>
  <li>Supporting policies;</li>
  <li>Effective enforcement;</li>
  <li>Governments must recognize adequately the impact of other legislation and regulations on competition;</li>
  <li>An open media and an informed judiciary;</li>
  <li>Politicians must be committed to wanting to make markets work well;</li>
  <li>Help build the technical capacity needed.</li>
</ul>

<p>To be fully effective, a competition policy must be supported by a <strong>culture of competition</strong>, where the objectives of the competition are widely understood and form a natural part of the background to the decisions made by the government, firms and the consumers. Civil society and a vigorous consumer movement in particular, can play a constructive and valuable role in the development of a culture of competition.</p>

<h2 id="indian-context">Indian Context</h2>

<p>This is a short primer on the history of Competition Act/Law in India. This introduction is mostly derived from the <a href="https://en.wikipedia.org/wiki/The_Competition_Act,_2002">wiki page</a> and <a href="https://www.linkedin.com/pulse/20140821065102-73187306-competition-law-in-india-an-overview">this overview</a> of the Act. If someone wants to get the whole deal (61 pages), they can go to <a href="http://www.cci.gov.in/competition-act">Competition Commission of India website</a> [<a href="http://www.cci.gov.in/sites/default/files/cci_pdf/competitionact2012.pdf">PDF</a>].</p>

<h3 id="introduction-to-the-act">Introduction to the Act</h3>

<p>The earliest version of competition law was set up 2 decades after Indian independence. In 1969, <strong>Monopolies and Restrictive Trade Practices Act</strong> (MRTP Act) was enacted after it’s introduction in the Parliament in 1967.</p>

<p>In 1999, three decades after the enactment of the act, then Finance Minister, <a href="https://en.wikipedia.org/wiki/Yashwant_Sinha">Yashwant Sinha</a>, in his budget speech, said-</p>

<blockquote>
  <p>The MRTP Act has become obsolete in certain areas in the light of international economic developments relating to competition laws. We need to shift our focus from curbing monopolies to promoting competition. Government has decided to appoint a Committee to examine this range of issues and propose a modern Competition Law suitable for our conditions.</p>
</blockquote>

<p>Following this, a draft Competition Law was prepared by the end of 2000, following which it was passed in December 2002. The MRTP act was repealed in 2009.</p>

<p><a href="https://en.wikipedia.org/wiki/Competition_Commission_of_India">Competition Commission of India</a> (CCI) is the statutory body to enforce the The Competition Act, 2002, throughout India. It is their duty to eliminate practices having negative effects on the competition in India. CCI can be approached to report any unfair competition practices. <a id="cci_outside"></a>Commission also has the power to inquire into the events taking place outside India, but having adverse effect on competition in India<sup>[<a href="#cross_border">3</a>]</sup>. I found this bit really interesting, having authority for the acts done outside India (there might be other laws having such power, but I am not studying such acts or law, so don’t know about them). There have been some major cases (BCCI, Google, etc.) violating the Law (<a href="https://en.wikipedia.org/wiki/Competition_Commission_of_India#Notable_cases">listed here</a>).</p>

<p>There are three major elements in Competition Act, 2002-</p>

<ol>
  <li>Anti-competitive agreements</li>
  <li>Abuse of dominant position</li>
  <li>Combinations</li>
</ol>

<h3 id="anti-competitive-agreements">Anti-competitive agreements</h3>

<p>Entities (enterprises, persons, cartels, etc) entering into agreements, with respect to, production, supply, distribution, storage, acquisition or control of goods or provisions of services, which cause (or likely to cause) an <strong>appreciable adverse effect</strong> on competition in India, are anti-competitive in nature. Such agreements are deemed void by the Act.</p>

<p>An agreement can happen by any arrangement - written, oral, formal, informal, or any other concerted action. There are two kinds of agreements for participating entities-</p>

<ol>
  <li>Horizontal Agreements: Agreements between rivals or competitors. For eg. cartelization.</li>
  <li>Vertical Agreements: Agreements between independent enterprises. For eg. between producers and suppliers or between producers and distributors.</li>
</ol>

<p><a id="adverse"></a>Some exemplars of agreements which can adversely effect the competition-</p>

<ul>
  <li>Determine prices (sales or purchase), directly or indirectly;</li>
  <li>Limit or control production, supply, markets, technical development, investment or provision of services;</li>
  <li>Share the market or source of production or provision of services by allocation of, inter-alia<sup>[<a href="#interalia">4</a>]</sup>, geographical area of market, nature of goods or number of customers or any other similar way;</li>
  <li>Directly or indirectly result in bid rigging or collusive bidding.</li>
</ul>

<h3 id="abuse-of-dominant-position">Abuse of dominant position</h3>

<p>The Act prohibits any enterprise or group from abusing it’s dominant position (market power, able to operate independently of the competitors, or affect its competitors or consumers or the relevant market in its favour). The existence of dominance is not looked down upon unless it’s abused. The abuse of dominant position can be when the enterprise or group-</p>

<ul>
  <li>Imposes, directly or indirectly, unfair or discriminatory conditions in purchase or sale of goods;</li>
  <li>Limits or restricts production of goods or provision of services or technical or scientific development relating to goods or services;</li>
  <li>Create hindrance in entry of new operators;</li>
  <li>etc.</li>
</ul>

<p>This dominant position is apropos to the relevant market decided by the CCI by considering <strong>product market</strong> or <strong>relevant geographical market</strong>. There are some guiding points in the Act to determine the relevant product market and relevant geographical market.</p>

<ol>
  <li>
    <p>Relevant Product Market</p>

    <ul>
      <li>Physical characteristics or end-use of goods;</li>
      <li>The price of goods of services;</li>
      <li>Consumer preferences;</li>
      <li>Exclusion of in-house production;</li>
      <li>The existence of specialized producers;</li>
      <li>And the classification of industrial products.</li>
    </ul>
  </li>
  <li>
    <p>Relevant Geographical Market</p>

    <ul>
      <li>Regulatory barriers;</li>
      <li>Local specification requirements;</li>
      <li>National procurement policies;</li>
      <li>Adequate distribution facilities;</li>
      <li>Transport costs;</li>
      <li>Language;</li>
      <li>Consumer preferences;</li>
      <li>And need for secure or regular supplies or rapid after – sales services</li>
    </ul>
  </li>
</ol>

<p>Even with the above points, it must not be that easy to determine the relevant market and consequently, the dominant position as well.</p>

<h3 id="combinations">Combinations</h3>

<p><a id="combination"></a>Combinations is a term used collectively for acquisition<sup>[<a href="#acquisition">5</a>]</sup> of control, shares, voting rights and assets, and mergers<sup>[<a href="#merger">6</a>]</sup> and amalgamations<sup>[<a href="#amalgamation">7</a>]</sup>. The Act prohibits entities from entering into a combination which causes (or likely to cause) an <strong>appreciable adverse effect</strong> on competition in India. These combinations shall be void. Of course, the adverse effect is seen in the relevant market determined by CCI as explained in the previous section.</p>

<h2 id="measuring-the-effects">Measuring the Effects</h2>

<p>Till now we looked at the paper where the authors discussed about the effect of competition on Poverty and Growth. And, to improve the state of competition in a country, they proposed a framework to form an effective Competition Policy. With respect to India, we broadly discussed how the Indian Competition Law is structured and what all it covers.</p>

<p>After reading all this material, I wanted to see if better competition really means less poverty. I wanted some kind of way to measure it. I wanted to see it in the data. I wanted to profile India on competition and poverty.</p>

<p>According to UN’s <a href="http://hdr.undp.org/en/composite/trends">Human Development Reports</a>, from 1990 to 2015, India’s Human Development Index (HDI) has consistently increased from 0.428 to 0.624. India was placed in the medium bucket. A country scores higher HDI when the lifespan is higher, the education level is higher, and the GDP per capita is higher. This indicates that development did happen, and this can be further drilled down. But, due to time constraints, I’ll end it here.</p>

<p>I’ll need to get the census data across various verticals, prepare a common data model, find out a method to measure the level of poverty and competition, find a relationship between them both and then do hypothesis testing to see if the effects are not random. This is not a weekend project.</p>

<h2 id="interesting-directions">Interesting Directions</h2>

<ul>
  <li>
    <p>While reading the material and looking the terms on Google, I read about <a href="https://en.wikipedia.org/wiki/Planned_obsolescence">Planned Obsolescence</a>. Planned Obsolescence happens when a product is designed with an artificially limited useful life, so it’ll become useless after a certain period of time. Light bulb is a famous example on the internet of this concept, where they are produced to die sooner than their actual lifetime. Then there are printer ink cartridges where we need to get a new one instead of getting them refilled, etc etc. There are many examples. I wonder, if this can come under abuse of dominant position? Do planned obsolescence affect competition? If yes, how?</p>
  </li>
  <li>
    <p><a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Nash Equilibrium</a>, a game theory concept, where the optimal outcome of a game is one where no player has an incentive to deviate from his chosen strategy after considering the opponent’s strategy. Overall, an individual can receive no incremental benefit from changing actions, assuming other players remain constant in their strategies. A game may have multiple Nash Equilibria or none at all. How are Nash Equilibrium and competition connected? Can it be said that, if there is a Nash Equilibrium in the market then the competition will be fair and effective? Or importantly, is it even that simple to apply Nash Equilibrium this way?</p>
  </li>
</ul>

<p>Note: The interpretation of the Law is my own (or derived from other articles on the web). I am not an authority on the subject.</p>

<p><br /></p>

<hr />

<p><br /></p>

<h3 id="glossary">Glossary</h3>

<p><a id="dominance"></a>1: Dominance is possible where a firm has strong market power that results from a high market share combined with barriers to entry. <a href="#point7">⤴</a></p>

<p><a id="cartel"></a>2: Now that’s a term I have only heard in movies, that too, in the context of drug suppliers. A Cartel is a group, of producers (and hence, they have the power) whose goal is to increase their collective profits. They usually engage in controlling selling prices. The <a href="https://en.wikipedia.org/wiki/Cartel">short wiki page</a> is insightful. <a href="#point7">⤴</a></p>

<p><a id="cross_border"></a>3: To deal with cross border issues, Commission is empowered to enter into any Memorandum of Understanding or arrangement with any foreign agency of any foreign country with the prior approval of Central Government. <a href="#cci_outside">⤴</a></p>

<p><a id="interalia"></a>4: Inter-alia - A Latin term for “Among other things”. This is a term used in legal proceedings to provide one example out of many. <a href="#adverse">⤴</a></p>

<p><a id="acquisition"></a>5: When one entity purchases the business of another entity, it is known as Acquisition. The acquiring company will be bigger in size than the acquired company. <a href="#combination">⤴</a></p>

<p><a id="merger"></a>6: The merger means the fusion of two or more than two companies voluntarily to form a new company. Generally, the sizes of the participating companies are similar. <a href="#combination">⤴</a></p>

<p><a id="amalgamation"></a>7: Amalgamation is the combination of one or more companies into a new entity. An amalgamation is distinct from a merger because neither of the combining companies survives as a legal entity. Rather, a completely new entity is formed to house the combined assets and liabilities of both companies. <a href="#combination">⤴</a></p>]]></content><author><name>Shivam Rana</name></author><category term="Publication" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Chatting Up - Part II</title><link href="https://trigonaminima.github.io/2018/04/chatting-up-2/" rel="alternate" type="text/html" title="Chatting Up - Part II" /><published>2018-04-22T00:00:00+00:00</published><updated>2018-04-22T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2018/04/chatting-up-2</id><content type="html" xml:base="https://trigonaminima.github.io/2018/04/chatting-up-2/"><![CDATA[<p>In this post, I have analyzed my WhatsApp and Facebook chatting data. I have been emailing myself WhatsApp chats regularly, and Facebook chats were a part of the data dump which you can initiate from your account settings. Then I wrote some code to parse the data out and create one consolidated csv of chats.</p>

<p>The first part is <a href="/2016/06/chatting-up/">here</a>. It was largely derived from <a href="http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/">this amazing post</a> by Stephen Wolfram, creator of Wolfram Alpha and Mathematica. This present post is all me. I have been at it for 2-3 weeks now, mostly late nights after work.</p>

<p>I have Facebook records since 2009, but in case of WhatsApp, it only starts from 2014 (when I realized, I should start saving these chats). Out of <u>2,242 days</u>, between 2009 and 2018, I was only active during <u>1,935 days</u>. Thats, <u>86% activity</u>. During these 1,935 days, the total texts, sent or received is <u>299,764</u> (119,042 sent and 180,722 received). These ~300K replies contain a total of <u>1,815,134 words</u> (872,940 written by me, 942,194 written by my friends). To give you something to compare with this number - War and Peace has around 561,304 words, Lord of the Rings series has around 828,045 words, and the Harry Potter series has around 1,084,625 words. I alone, have written material equivalent to LOTR in quantity, let alone, what my friends and I can do together. Jokes aside, while doing this, I realized, we write A LOT in our daily lives without even thinking about it much. I am just analyzing chats here, but there are emails, blog posts, tweets, FB posts, SMS, code, and whatnot.</p>

<p>WhatsApp conversations happened only on my Android (no WhatsApp Web or emulators). In case of Facebook, during college, I used Android Messenger and Facebook website, intermittently. Post graduation, majority of the chats were through messenger. Out of the 299,764 texts, <u>12.4% were on WhatsApp</u> and <u>87.6% through Facebook</u>.</p>

<p>Assuming, reading/writing all the texts within a minute took me that whole minute, I spent <u>102,511 minutes</u> (or 1,709 hours, or 71 days, or 2.4 months) of my life, just chatting. Naturally, actual time will be lesser than this, but this is the closest we can come to calculate this number. During each minute of these 2.4 months of continuous conversations, <u>for every text I sent, I received two replies back</u>. Now, this 1:2 ratio suggests that I am not as active as the other person, but later in the post, I’ll give evidence which will suggest otherwise.</p>

<p>Here’s the daily footprint of my chats.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/daily_timestamp.jpeg" alt="daily_timestamp" /></p>

<p>There are 3 highlighted regions - College Life, Professional Life (2 different jobs). Most of the college segment was discussed in the first part. Lets discuss the time of end of the college and the job life.</p>

<p>If you look towards the <u>end of the college (May 2016)</u>, there’s a huge surge in conversations. There were many reasons - final year college project discussion with project partner, long chats with some friends, farewell, last day, saying goodbyes. This was also the time when I played some puerile truth-dare games in group chats. I had participated in six groups during this time - May 2016 (three in April and two in June). Another interesting thing is, between the time the college ended and the job started, how the last reply at the night shifted from 4:00 AM to 2:00 AM.</p>

<p>It’s obvious, I was pretty active during college than now, but largely, the <u>active hours remains late night</u>. Throughout my professional life, there are gaps (mostly, half a month) during which, there were no conversations at nights - those were my “late night at work” days. Overall, my conversations during day (relative to nights) have increased post college. This is apparent from my <u>day-to-night activity ratio bumping from 0.22 in 2016 to 0.53 in 2017</u>. My chats at night, during professional life, hardly goes beyond 2:00 AM, owing to going on time to office the next morning. Similarly, during that time, my chats during the day mostly start after 10:00 AM. This, again, is because, I chat during commute or after reaching the office. If you look carefully, the conversations are intermittent, that’s because, I send a reply then do some work and then reply again if the other person has responded.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/hour_weekday.jpeg" alt="hour_weekday" /></p>

<p>The 1st plot shows the hourly count of texts. The part between 5:00 AM and 9:00 AM is pretty much zero. This plot specifically shows, I was <u>highly active from 8:00 PM to 2:00 AM</u>. There’s a rapid increase from 5:00 PM till midnight and then a rapid decrease after midnight. In the 2nd plot, my conversations are higher during the weekend with a maxima on Sunday. This was expected.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/yearly.jpeg" alt="yearly" /></p>

<p>The years, <u>2015 and 2016 were the most active years</u> in terms of total traffic. Comparing 2015 and 2016, my friends talked more during 2015 than in 2016, whereas, I talked more during 2016 than in 2015.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/monthly.jpeg" alt="monthly" /></p>

<p>This plot shows some features which were not visible in the yearly plot. Firstly, there’s a <u>dip in the first month of each year</u>, except 2015. This is usually the winter break; internship or traveling are the likely reasons for these lows. There’s a huge peak during 2016 first half - the last semester of the college. The reasons for the peak were discussed previously.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/yearly_hourly.jpeg" alt="yearly_hourly" /></p>

<p>The above plot shows the hourly distribution of conversations for each year from 2012 onwards. Let’s start from 2012 (bottom left).</p>

<p>In <u>2012</u>, most of the conversations were after joining the college in July’12. The slope starts increasing after 3:00 PM. That’s because the college, for most days, got over around 1:00 PM or 2:00 PM. Most of these conversations are with Duffer (name masked for privacy :p). Majority conversations were over before midnight. In <u>2013 to 2016</u>, however, things got flat during the day, even around 3:00 PM. The high traffic shifted to 8:00 PM and majority conversations shifted to later during the night (2:00 AM or 3:00 AM was the usual). The year <u>2016</u> is specifically high in text count during late night. Now comes the year <u>2017</u> (Post college days; professional life). The overall traffic lessened during 2017 and a new peak formed between 10:00 AM to 3:00 PM. These were the short conversations with working friends or new work friends, during the day. If you look carefully, my conversations during nights have decreased in 2017, although, still the peak time. Things look different during 2018. Though the data is of only 3 months, the conversations were clearly low. Nighttime is not the peak anymore now. And things have gone higher during the day. I wonder how the pattern would be at the end of 2018.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/cumulative.jpeg" alt="cumulative" /></p>

<p>The above two figures show count wise and size wise cumulative plots of the replies. The pattern is almost the same in both the figures, but still there’s a difference. The 1st plot shows that my friends sent me way more messages than I had sent them back. I am <em>almost</em> guilty of being a lazy ass from this plot. But! But! And, here is my proof, that I was not as lazy as it seems. The second figure shows the cumulative plot of reply sizes. If you look at it, the line of incoming (orange) and outgoing (green) are very close. The small gap which is still there, is because of the group chats, otherwise, the lines mostly overlap (I checked this). The conclusion is, <u>even if I reply less, I make up for it by writing long replies</u>.</p>

<p>Other than this, there are two major bumps - <u>mid 2013 (May'13)</u> and <u>mid 2016 (May'16)</u>. Mid 2013 is a month before the summer break and almost a year after I joined the college. Mid 2016 was the end of college and I have discussed it quite a lot during the previous few plots.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/monthly_avg_friends.jpeg" alt="monthly_avg_friends" /></p>

<p>The above plot is my monthly average of the friends I talked to. Observations prior to 2016 were discussed in part 1 post, so here, I’ll talk about the period during and after 2016. You can see, on an average, I talked to six friends during May’16, the highest in 2016. This was the last month of college. Interestingly, six was also the highest during May’13. There were a total of 30 and 31 friends I talked to, during May’13 and May’16, respectively. Out of eleven common names, only two were outside college group.</p>

<p>Let’s discuss some friends wise analyses. Names are masked for privacy reasons. But the mentioned individuals will identify themselves from the masked names. So win-win, I guess.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_unique_days.jpeg" alt="top_n_unique_days" /></p>

<p>The above plot shows the top 20 friends arranged by the count of days on which we conversed. You can see, only three individuals cross the one year line and <u>only one person goes beyond two years</u>. Duffer tops many lists along with this one. I have talked to him quite a lot.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_conversations.jpeg" alt="top_n_conversations" /></p>

<p>This plot tries to quantify how many new conversations I had. A <u>new conversation</u> is taken as any message which was sent after 8 hours from the previous text. There would be <em>very few</em> conversations which were continued after 8 hrs gap. However, there might be many which were under 8 hrs, but if I reduce the threshold, then I take the risk of getting more false positives. So, I stopped at 8 hrs as my threshold. One way to see if it worked is by comparing this plot and the previous one. Both have same y-scale, that is, number of days talked and number of new conversations are in approximately same. This makes sense. If I talked to someone last night and I am barely active during the day then, next night conversation will be, almost always, after an 8 hr gap. This gives us two new conversations on two different days.</p>

<p>One surprising thing present in the plot is, with most of the friends, my tendency of starting a conversation is quite low, with a minimum being 5%. This is probably the first learning from this analysis. I should initiate the conversations more. It is a relief, though, that I participate equally once the conversation starts.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_texts.jpeg" alt="top_n_texts" /></p>

<p>These above two figures again show the difference between my reply count and reply size when divided by top friends. Duffer is at the top. Many friends swap places in between the plots, but more or less, they are same. I am again saved by the 2nd plot.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_mean_text_length.jpeg" alt="top_n_mean_text_length" /></p>

<p>Here I have plotted for the top 20 friends by the reply count, the average reply size. The bars are arranged by avg mean length of the reply by the friend. Yes! Here my avg is better for almost every friend. So, this is the final evidence that I am not lousy at replying. My replies are lengthier and hence my reply count is low.</p>

<p>Here, Lawyer tops the list. Duffer went down to the 10th spot.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_yearly_length.jpeg" alt="top_n_yearly_length" /></p>

<p>This kind of plot is called a parallel plot. For the selected friends, I have plotted reply counts for each year. I wanted to see <u>with whom was I most active during that particular year</u>. Duffer was at the top during from 2012 to 2015. In 2016, Lawyer beat everyone for the top. And in 2017 and 2018, AK came on top.</p>

<p>Duffer and I used to talk a lot, but then we were kinda <em>out of topics</em> after  three years, and we both got into different things. In 2016, Lawyer and I were in a relationship, so naturally, she topped the list. I didn’t talk much to others during that time, except AK. AK and I, both got into Opera Solutions, so we had a lot of conversations post college, leading him to top the list in 2017. Duffer also made a comeback in 2017 by being 2nd.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_yearly_uniq_days.jpeg" alt="top_n_yearly_uniq_days" /></p>

<p>This is a parallel plot of number of unique days, I talked to that person per year. Duffer tops from 2012 to 2015, same as last plot. Lawyer tops in 2016. In 2017 and 2018 though, Duffer replaces AK. So, overall, Duffer tops in every year except in 2016. I hope these analyses don’t look like I want to show where Duffer came first. This next plot should take care of that.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_yearly_avg_len_uniq_days.jpeg" alt="top_n_yearly_avg_len_uniq_days" /></p>

<p>This plot is interesting. Two specific friends - Lawyer and AK - with both of them, the average text length per day is higher than others, except in 2012. That’s probably because, with both Lawyer and AK, I have had long discussions about many interesting things. And, they usually involved to-n-fro of some long arguments. Most notable is the year 2014, where with Lawyer, average length reached more than 7,000 character. As can be gathered from the last plot, during the period prior to 2016, Lawyer and I talked on very few occasions and this probably inflated the average.</p>

<p>Another interesting thing is, even though, <u>I have talked with Duffer a lot, our average length is low every year</u>. This could indicate that we haven’t conversed using long sentences. This could also be, which is more likely, that because we have talked a lot (100+ separate days during most of the years), the average was brought down by a large number of small texts. We must have talked with long sentences on some days, but on majority of days our texts would be short and hence the average would also be low. Although, I suspect, this average would be close to the actual average.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_normalized.jpeg" alt="top_n_normalized" /></p>

<p>This is a parallel plot where each friend is ranked on various measures. There were far many measures, but to not to crowd the plot, I have limited to 10 metrics. You can see how each friend does across the metrics. Parallel plot was the most relevant plot, I could think of, for this kind of relative ranking across multiple measures. Also, each metric was normalized to bring everything on the same scale for the plot. So, if my reply rate and my friend’s reply rate (5th and 6th vertical line in the plot) are not on the same level, that doesn’t mean we have a gap in our rates. It’s just normalized according to other values in the column.</p>

<p>Lets take the example of Duffer (Dark blue line) to walk through the plot. We have high messages counts. Messenger is our common mode of conversation. My reply rate is lowest for him (this is biased - since we have talked a lot, there will more instances of delay during replies). His reply rate is high though. So, he is, more active responder than me. Average delay between our replies is okayish. Our total replies and mean text length per unique day is at the center (meaning, they are closer to the average of all the values for that metric).</p>

<p>We can also get other interesting bits from the above plot. For the Lawyer, the major mode of conversation is WhatsApp. My reply rate per minute is highest for PG. Average delay between replies is lowest for AK. We both usually have latency free conversations with each other. Total unique texts per day are the lowest for Vin2.</p>

<p>Below is the cumulative text count plot for the top 10 friends.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_cumulative.jpeg" alt="top_n_cumulative" /></p>

<p>Enough discussions have been done on Duffer. Lets discuss others. With AK, things started from 2015. I cant seem to remember how it started but it was on-off till 2016. From 2016 we picked up the pace. Then we also went to the same company so things have been pretty active, such that, he reached the 2nd place based on the text count.</p>

<p>Things plateaued from mid 2016 to early 2017 for Duffer, Nicks and PG, because they were preparing for an exam during that time. With Lawyer, only bump is during 2016 when we were dating. other than that the plot is flat. With AMK, consistent chats happened till 2017 after which we haven’t talked.</p>

<p><img src="https://trigonaminima.github.io/assets/2018-04/top_n_heatmap.jpeg" alt="top_n_heatmap" /></p>

<p>This is the heatmap for the frequency of chats each day through the time scope. Note that, I have limited the colorbar till the frequency of 500. There are some instances where the frequency is higher than that.</p>

<p>All the previously discussed frequency patterns can be seen here easily. Duffer is most green throughout. There are a few dark green lines, but mostly the frequency is towards the slightly dark green. AK’s green lines are towards the end, and there are many dark green fringes explaining his reaching to the 2nd spot. Nicks’ is also green for most of the part, but it is mostly light green and slightly dark green, because, we usually just catchup and then end the conversation. Lawyer has dark green segment only during 4-5 months of 2016. (I just realized, each of these horizontal bars look like sound pressure waves. I wonder how each friend will sound like.)</p>

<p><br /></p>

<p>One thing, that I thought would be interesting was the application of Poisson’s Distribution on the data. Poisson distribution is the distribution of number of events occuring in a given time period. So, if your data follows that distribution, then you can find the probability of number of times that event is going to occur next. My data was how many new conversations do I have within a day. So, I can find the probability of the <code class="language-plaintext highlighter-rouge">n</code> new conversations I’ll be having the next day. Sadly, the data didn’t follow the distribution. The goodness of fit test, which is used to check whether a data follows the distribution, gave a very very small p-value rejecting the null hypothesis (null hypothesis was that data follows the distribution). :(</p>

<p>These were some of the analyses I did on the data. There is much more that can be done. Some of them which were not included here, can be found in the <a href="https://github.com/TrigonaMinima/Chats">github repository</a> in the Jupyter Notebook. The data preparation and other related code is also there.</p>

<p>Another thing, which I haven’t delved into, is the text analysis. Some interesting analyses can be:</p>

<ul>
  <li>Finding out the proportion of content/non-content data in the text. Non-content data means words like “ok”, “okay”, “hmmm”, etc;</li>
  <li>Use of emojis;</li>
  <li>Conversation starters;</li>
  <li>Conversation endings;</li>
  <li>Web links shared;</li>
  <li>Change of conversation starter words with time (overall and friend wise);</li>
  <li>Topic modeling;</li>
  <li>Sentiments;</li>
  <li>Graph of connections - I have my side of information where I can make a graph of friends and the groups they were in with me.</li>
</ul>

<p>These are just from the top off my head. There would be many more which I haven’t even thought of. But this is all future work. How else will I create a part 3! Huh?</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-Self" /><category term="Data-Analysis" /><summary type="html"><![CDATA[In this post, I have analyzed my WhatsApp and Facebook chatting data. I have been emailing myself WhatsApp chats regularly, and Facebook chats were a part of the data dump which you can initiate from your account settings. Then I wrote some code to parse the data out and create one consolidated csv of chats.]]></summary></entry><entry><title type="html">Anime I watched in 2017</title><link href="https://trigonaminima.github.io/2018/01/anime-watched-2017/" rel="alternate" type="text/html" title="Anime I watched in 2017" /><published>2018-01-27T00:00:00+00:00</published><updated>2018-01-27T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2018/01/anime-watched-2017</id><content type="html" xml:base="https://trigonaminima.github.io/2018/01/anime-watched-2017/"><![CDATA[<p>Animes are great in story, drama, and action. It is one of those visual arts where you get to see the beautiful artwork, engaging story telling, incredible imagination and intense development of characters. At each step, the audience will be kept up-to-date with the story; you’ll never feel lost, except may be, during the first few episodes.</p>

<p>Animes are so sprinkled with psychology, mythology, science and what not, if you observe them carefully. They even contain life lessons. One thing I always do once I am done with an Anime is to go read it’s wiki page. There I find many interesting bits about the themes explored in the story line. So yeah, if anyone thinks animes are for children then he/she should try watching a few.</p>

<p>Here is the chronological list of animes I watched last year. I also give my rating along with it. While preparing this list, I went through the wikis of these animes again. It was fun to read them.</p>

<h2 id="fullmetal-alchemist-brotherhood-45-64-episodes"><a href="https://en.wikipedia.org/wiki/Fullmetal_Alchemist">Fullmetal Alchemist: Brotherhood</a> [4/5] [64 Episodes]</h2>

<p>This anime is pretty intense in emotions most of the times. The story is about two brothers (teenagers) in a world of alchemy. <a href="https://en.wikipedia.org/wiki/Alchemy">Alchemy</a> is this technique of creating anything by providing something of equal value, governed by the Law of Equivalence (law of conservation of mass reference). Alchemists are forbidden from transmuting Humans and Gold. There is a <a name="pantheism2"></a>pantheistic entity<sup><a href="#pantheism1">[1]</a></sup> called Truth, which regulates the use of Alchemy. Those who attempt human transmutation, are sent to the Gate of Truth for the punishment, as a result, they lose some part or the whole of their body. The two brothers try human transmutation to bring their mother back, which obviously is not successful. This marks the start of their journey to find the solution to the serious loss they suffered as punishment. This journey involves finding about Philosopher’s stone (which enables one to perform alchemy by bypassing the Law of equivalent exchange), a government conspiracy and their father’s past.</p>

<h2 id="steinsgate-55-24-episodes"><a href="https://en.wikipedia.org/wiki/Steins;Gate_(anime)">Steins;gate</a> [5/5] [24 Episodes]</h2>

<p>This anime will blow your mind. There are no themes here. It’s an amazing sci-fi story of a self-proclaimed “mad scientist” who, in his “Future Gadget Laboratory” along with two other members, somehow discovers a method of sending text messages back in time using a cellphone-operated oven. Yes, the setting is ridiculous, but the whole time loop shown is amazing. I loved how the series progressed into a very interesting plot with a spectrum of emotions.</p>

<h2 id="future-diary-35-26-episodes"><a href="https://en.wikipedia.org/wiki/Future_Diary">Future Diary</a> [3/5] [26 Episodes]</h2>

<p>This one is okay-ish. This again as no themes. This is just pure battle royale between 12 players having diaries with different abilities. All players’ diaries have different abilities, but mostly related to seeing future in some way, hence, “Future Diary”. This series has Deus Ex Machina, the God of Space and Time. This series has Apocalypse. At times this series is really dark. This series even has super computer. Other than that, it’s just okay.</p>

<h2 id="rick-and-morty-55-31-episodes"><a href="https://en.wikipedia.org/wiki/Rick_and_Morty">Rick and Morty</a> [5/5] [31 Episodes]</h2>

<p>This is an American animated series. It tells the story of a crazy scientist Rick Sanchez and his grandson Morty Smith. Rick takes him to intergalactic/inter-dimensional missions and the series revolves around those adventures. Sometimes Rick’s sister also crashes the party. This series is hilarious. We have family drama. We have great Sci-fi. We have darkness. The series address the point that we not alone in the universe and compared to the universe humans are so insignificant. Till now there have been 3 seasons. The 4th season is in the making.</p>

<h2 id="sword-art-online-sao-45-49-episodes"><a href="https://en.wikipedia.org/wiki/Sword_Art_Online">Sword Art Online (SAO)</a> [4/5] [49 Episodes]</h2>

<p>SAO focuses on Virtual Reality <a href="https://en.wikipedia.org/wiki/Massively_multiplayer_online_role-playing_game">MMORPG (Massively Multiplayer Online Role-Playing Game)</a>. Basically, online games where you “role play”. You have a virtual world where you interact with other players. During the first part of the series, players are trapped inside a game called Sword Art Online. If they die in-game, then they die in real life too. So, 1st part the story of how they got out of it. Fighting sequences are enjoyable and imagination is exuberant.</p>

<h2 id="durara-25-26-episodes"><a href="https://en.wikipedia.org/wiki/Durarara!!">Durara!!</a> [2/5] [26 Episodes]</h2>

<p>I dropped this one at the last 4-5 episodes. I completely lost interest in this one. It was just going on…</p>

<h2 id="ping-pong-55-11-episodes"><a href="https://en.wikipedia.org/wiki/Ping_Pong_(manga)">Ping Pong</a> [5/5] [11 Episodes]</h2>

<p>This is 11-12 episode series where we see the journey of two childhood friends winning it in table tennis (ping pong). With an incredible story and diversity of emotions combined with a very different and unique artwork than normal animes, this series was exceptional. What’s the cherry on the top is that, the creators were able to do it in 11 episodes.</p>

<h2 id="fatestay-night-35-24-episodes"><a href="https://en.wikipedia.org/wiki/Fate/stay_night">Fate/stay night</a> [3/5] [24 Episodes]</h2>

<p>This is another Battle Royale themed series. Here we have a group of seven sorcerers, called masters who are chosen by the Holy Grail. Holy Grail also grants each one of those masters a servant, reincarnations of legendary heroes from all times (both fictional and real). Servants are for the protection of their masters, as well as, to kill other masters or servants. Whoever wins, gets the holy grail which can fulfill any wish. The story is about the 5th Holy Grail wars which started prematurely (40 years earlier than it’s time). This series had a lot of creative imagination and references from real world. For instance, Holy Grail used as an all powerful entity which grant wishes has a <a href="https://en.wikipedia.org/wiki/Holy_Grail">rich history</a> surrounding it. Similarly, each servant has a past which is again inspired from the real world. Characters shown are just great.</p>

<h2 id="fatezero-45-25-episodes"><a href="https://en.wikipedia.org/wiki/Fate/Zero">Fate/Zero</a> [4/5] [25 Episodes]</h2>

<p>This is prequel to the Fate/stay night. It ends where the Fate/stay night picked up. It’s 4th Holy Grail war, we have different masters and different servants. Here we are told that 3 earliest magi families are the ones who developed the Holy Grail wars. And all the past grail wars were inconclusive. The masters and servants are both very imaginatively developed. The animation is better visually and in quality. Story is great too.</p>

<h2 id="log-horizon-45-50-episodes"><a href="https://en.wikipedia.org/wiki/Log_Horizon">Log Horizon</a> [4/5] [50 Episodes]</h2>

<p>This is another MMORPG series where the players are trapped inside the game. Based on similar premise as SAO, but it’s still different. First is, they didn’t get out of the game till the end. Second is, in-game death is not a real life death. I found this series less “teenagery”. What I especially liked about this series is, how everyone (almost, anti-social characters are always there) got together and started building a society complete with leaders, central bank, guilds, military and the whole deal.</p>

<h2 id="neon-genesis-evangelion-45-26-episodes"><a href="https://en.wikipedia.org/wiki/Neon_Genesis_Evangelion">Neon Genesis Evangelion</a> [4/5] [26 Episodes]</h2>

<p>This one is very different. This series reminded me of the movie <a href="https://en.wikipedia.org/wiki/No_Country_for_Old_Men_(film)">No Country for Old Men</a>. When I finished that movie, I was like, what just happened here. I didn’t understand it properly. Same was the case with this series. I finished this whole series which was pretty dark at times, in the end, leaving me with the question of what just happened there. Reading the wiki page kind of showed how deep in themes this anime is. It is littered with the references to religion, philosophical and psychoanalytical concepts. The story is around the experiences and emotions of the pilots of these bio-engineered mecha called Evangelion. These evangelions fight the Angels (a race of giant monstrous beings), to prevent the destruction of earth.</p>

<p>Apparently, when the Japanese anime industry was going through a slump period this anime was very inspiring to them. It brought new insights into the animation, specifically, mecha genre. It brought innovation.</p>

<h2 id="no-game-no-life-35-12-episodes"><a href="https://en.wikipedia.org/wiki/No_Game_No_Life">No Game No Life</a> [3/5] [12 Episodes]</h2>

<p>The series features a brother sister duo, known as blank, killing it in every game they play. Then one day, they are transported to a world where every decision gets made by playing a game, any game of their choice. Humans in this world are lowest in the hierarchy. The duo represent humans and take decide to take upon themselves to the previous glory to further go on to beat the god of this world. This anime only has one season whereas it’s manga version is quite long.</p>

<h2 id="the-devil-is-a-part-timer-35-13-episodes"><a href="https://en.wikipedia.org/wiki/The_Devil_Is_a_Part-Timer!">The Devil Is a Part-Timer!</a> [3/5] [13 Episodes]</h2>

<p>A story about Demon Lord Satan Jacob coming to earth after fleeing from the hero Emilia Justina. Here due to lack of powers, Satan is forced to work in a fast food restaurant named MgRonald (McDonald’s watch out). Then Satan and the Hero Emilia meet on Earth and the series follows the story after that. This series is a comical series. There’s nothing more to it. Also, not all the volumes are made into anime.</p>

<h2 id="samurai-champloo-35-26-episodes"><a href="https://en.wikipedia.org/wiki/Samurai_Champloo">Samurai Champloo</a> [3/5] [26 Episodes]</h2>

<p>If you have watched <a href="https://en.wikipedia.org/wiki/Cowboy_Bebop">Cowboy Bebop</a> then you might like this. Samurai Champloo was created by the the creator of Cowboy Bebop. The series is quite artistic. It uses Japanese History references with a modified version for the story.</p>

<h2 id="hunter-x-hunter-45-148-episodes"><a href="https://en.wikipedia.org/wiki/Hunter_%C3%97_Hunter">Hunter x Hunter</a> [4/5] [148 Episodes]</h2>

<p>This was the last and the longest anime that I watched in 2017. The plot is quite long and complex to explain. It’s better to read the wiki page. I wont be able to do justice to it in a few sentences. I was skeptical at first about this series, but as the story developed, characters got deeper and involved, animation gets better, fighting sequences are beautifully shown. It’s an amazing series. If you are going to watch it, watch the 2011 version.</p>

<p><br /></p>

<p>Fans of <a href="https://scifi.stackexchange.com/q/39283">Japanese Manga</a> might call me a fake based on the idea that mangas are better than Anime. There are plenty memes floating around on that. May be they are better, but there are A LOT of mangas out there not to mention numerous volumes (as versions, sequels and prequels) of each. On the other hand, non-anime watchers might call me a <a href="https://www.urbandictionary.com/define.php?term=weeb">weeb</a> (I very recently came to know about this word). Alas, they don’t know the meaning of the word and just use it without thinking about it. I can’t blame them though; there are far more frequently used words which are inaccurate, misleading, misused, ambiguous, or logically confused like <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4522609/">this study</a> lists.</p>

<p><br /></p>

<p><a name="pantheism1"></a>1: <a href="https://en.wikipedia.org/wiki/Pantheism">Pantheism</a> is this belief that everything makes the god. Nature is god. We are god. The concept of pantheism belongs to theology as well as Philosophy. Famous individuals like Carl Sagan and Einstein are considered to be the followers of pantheism.<a href="#pantheism2">⤴</a></p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[Animes are great in story, drama, and action. It is one of those visual arts where you get to see the beautiful artwork, engaging story telling, incredible imagination and intense development of characters. At each step, the audience will be kept up-to-date with the story; you’ll never feel lost, except may be, during the first few episodes.]]></summary></entry><entry><title type="html">Benford’s Law as a Fraud Detection technique</title><link href="https://trigonaminima.github.io/2017/07/benfords-law-fraud-detection/" rel="alternate" type="text/html" title="Benford’s Law as a Fraud Detection technique" /><published>2017-07-29T00:00:00+00:00</published><updated>2017-07-29T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2017/07/benfords-law-fraud-detection</id><content type="html" xml:base="https://trigonaminima.github.io/2017/07/benfords-law-fraud-detection/"><![CDATA[<p>Recently, I picked up a handy technique for narrowing down the dataset to look for fraud anomalies - <a href="https://en.wikipedia.org/wiki/Benford's_law">Benford’s Law</a>. It basically states that, <strong>in a collection of naturally occurring numerical values, the frequency of the first digit follows a exponential decreasing trend as digit goes from 1 to 9</strong>.</p>

<p>What? In simple terms, in a list of numbers, if we determine the counts of all the first digits - 1 to 9, then we should see rapidly decreasing counts from 1 to 9.</p>

<p>How, do you ask, do we know, it follows such a pattern? Well because of, some careful observation by a curiosity driven scientist, followed by experimentation and defining it into mathematical form - in short, SCIENCE.</p>

<p>Maybe some examples will make it explicable. Here you go.</p>

<ol>
  <li><strong>Food prices</strong></li>
</ol>

<p>I got to know about this data from a mailing list I follow - <a href="https://tinyletter.com/data-is-plural">Data is Plural by Jeremy Singer-Vine</a>.</p>

<p>The UN World Food Programme’s <a href="http://vam.wfp.org/">vulnerability analysis group</a> collects and publishes <a href="https://data.humdata.org/dataset/wfp-food-prices">food price data for more than 1,000 towns and cities in more than 70 countries</a>. The dataset, which goes back more than a decade, covers basic staples, such as wheat, rice, milk, oil, and more. It’s updated monthly and feeds into (among other things) the <a href="http://foodprices.vam.wfp.org/ALPS-at-a-glance.aspx">UNWFP’s price-spike indicators</a>.</p>

<p>Here’s the Benford plot I got from the data. There were 743,914 records in total. To see a lot of other exploration around the data along with the Benford’s, you can look at this jupyter notebook - <a href="https://trigonaminima.github.io/Notebooks/2017/07/30/Benford-Food-Prices/">Benford’s Analysis: Global Food Prices</a>.</p>

<p><img src="https://trigonaminima.github.io/assets/2017-07/bnfd-1.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>

<ol>
  <li><strong>State Elections India</strong></li>
</ol>

<p>Again through Data is Plural mailing list. Five states in India, representing nearly 250 million residents — Punjab, Uttar Pradesh, Uttarakhand, Goa, and Manipur - have already held legislative assembly elections this year. India’s Election Commission publishes these results, but only as webpages. A couple of Hyderabad-based developers have scraped the website, and published CSVs of the <a href="https://github.com/Vizbi/state-elections">data on GitHub</a>.</p>

<p><img src="https://trigonaminima.github.io/assets/2017-07/bnfd-2.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>

<p>Above is the Benford plot for the number of votes per constituency. There were 7,849 records, pertaining to various constituencies for 5 states against each political party. As you can see it follows the Benford’s trend. For more exploration on the data you can have a look at this notebook - <a href="https://trigonaminima.github.io/Notebooks/2017/07/31/Benford-State-Elections-India-2016/">Benford’s Analysis: State Elections India</a></p>

<p>Okay, so it does work. Is it that simple? Sure it is, but there are some conditions. These conditions are not something which are always correct, but most of the times, they are.</p>

<ol>
  <li>
    <p>The more orders of magnitude the data evenly covers the more accurate the law will be. (uniform distribution; orders of magnitude)</p>

    <p>Some real world distributions those follow this rule are - population of villages, stock prices, food prices (we discussed this example above). Some others that won’t follow the rule would be - heights of human adults, IQ scores, temperature measurements (we will discuss this example later in the post).</p>
  </li>
  <li>
    <p>When the data has right skew. (mean &gt; median; long right tail)</p>

    <p>Basically, when there is a long right tail in the distribution. This happened in the above 2 examples. (Checkout the notebook links. Why do you think they were there?)</p>
  </li>
  <li>
    <p>Number resulting from mathematical combinations.</p>

    <p>Eg. price x quantity</p>
  </li>
  <li>
    <p>Transactional level data.</p>

    <p>Eg. Sales, reimbursements, spendings</p>
  </li>
  <li>
    <p>Data has no pre-defined minimum and maximum</p>
  </li>
  <li>
    <p>Normal distributions don’t follow the Benford’s law.</p>

    <p>We’ll see this demonstration below.</p>
  </li>
  <li>
    <p>Sequential quantities wont follow the Benford’s law</p>
  </li>
</ol>

<p>Any example of where it doesn’t work? Here you go.</p>

<p><strong>Temperatures</strong></p>

<p>I cant find the link for the dataset used. It was an open dataset I had downloaded some 7-8 months back. I had downloaded just for the India category. It consists of the temperatures from 1700’s to present (with a lot of NaNs for earlier periods). Here’s the notebook - <a href="https://trigonaminima.github.io/Notebooks/2017/07/30/Benford-Climate/">Benford’s Analysis: Climate</a> - which has a lot of details around the Benford’s analysis. Here are the final plots.</p>

<p><img src="https://trigonaminima.github.io/assets/2017-07/bnfd-3a.png" style="display: block;margin-left: auto;margin-right: auto;" />
<img src="https://trigonaminima.github.io/assets/2017-07/bnfd-3b.png" style="display: block;margin-left: auto;margin-right: auto;" />
<img src="https://trigonaminima.github.io/assets/2017-07/bnfd-3c.png" style="display: block;margin-left: auto;margin-right: auto;" />
<img src="https://trigonaminima.github.io/assets/2017-07/bnfd-3d.png" style="display: block;margin-left: auto;margin-right: auto;" />
<img src="https://trigonaminima.github.io/assets/2017-07/bnfd-3e.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>

<p>As you can see, none of the plots show a proper following of the Benford’s law. The frequencies should have a decreasing trend as we move from 1 to 9. Data has a predefined low and high. Last to plot were almost normally distributed. A lot of “nays” where there. <a href="Benford's Analysis - Climate">Notebook</a> dives deeper into the analysis.</p>

<p>Now, we have come to a million dollar question. Like literally, it can help us save millions of dollars (or to discover that they are lost?).</p>

<p><strong>How does the Benford’s Analysis help with fraud?</strong></p>

<p>Assuming the data should follow the Benford’s law, there can be 2 reasons when it wont follow it - legitimate and fraudulent. Obviously.</p>

<p>Legitimate reasons can be - merging of low-figure amounts or something (service, product) that has to be paid frequently and has a fixed rate/price. These cases will throw off the Benford’s trend.</p>

<p>Fraudulent reason is something we are interested in. It is based on the fact that, a fraudster, to maximize his/her gains, would put in a larger value (starting with 8 or 9). Or, if a person is using values starting with ones then usually, they keep rounded values. So, Benford’s law can also be done for 2nd digit. It’s been generalized for any number of digits, but second digit is mostly enough. And, if it has been done a lot of times then the frequencies wont be in line with Benford’s. So, using Benford’s law doesn’t exactly give us anomalies, it just narrow down the dataset to something more manageable.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Benford’s law is a great technique to get a subset where you are likely to find some fraudulent transactions. It’s not guaranteed to work every time, but it’s really inexpensive to perform. If the trend seems to be way off from the ideal one then there you have something to look else just move on.</p>

<p>PS. Benford’s Law is a special case of Zipf’s Law. There is a very cool video from vsauce - <a href="https://www.youtube.com/watch?v=fCn8zs912OE">The Zipf Mystery</a></p>]]></content><author><name>Shivam Rana</name></author><category term="Data-Analysis" /><category term="Fraud" /><summary type="html"><![CDATA[Recently, I picked up a handy technique for narrowing down the dataset to look for fraud anomalies - Benford’s Law. It basically states that, in a collection of naturally occurring numerical values, the frequency of the first digit follows a exponential decreasing trend as digit goes from 1 to 9.]]></summary></entry><entry><title type="html">From Vector Space Models to Recommender Systems</title><link href="https://trigonaminima.github.io/2016/11/vsm-to-rec-sys/" rel="alternate" type="text/html" title="From Vector Space Models to Recommender Systems" /><published>2016-11-02T00:00:00+00:00</published><updated>2016-11-02T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2016/11/vsm-to-rec-sys</id><content type="html" xml:base="https://trigonaminima.github.io/2016/11/vsm-to-rec-sys/"><![CDATA[<h2 id="vector-space-models-vsm">Vector Space Models (VSM)</h2>

<h3 id="what-is-it">What is it?</h3>

<p>A VSM is a way to represent a document in an <em>n-dimensional space vector</em> where “n” is the size of the vocabulary of <em>terms</em> present in the set of documents that we are trying to represent. These <em>terms</em>, can be, individual words in the documents or some keywords from the documents that we want to focus on or longer phrases and are largely dependent on the problem at hand. If you are familiar with basic 3D vector mathematics (normal 3D space, in non-fancy words, that we interact with daily), then VSM can be thought of as a point or directed line corresponding to each document in a n-dimensional space emanating from the origin. Thus, each document will be embedded in a n-D space as a directed line or a point.</p>

<p>Lets say we want to make the vectors from individual words. To keep the visualization easy, I am fixing my vocabulary to only 3 words with my documents as follows (btw, <em>The Boxer Rebellion</em> is a band’s name).</p>

<ul>
  <li><strong>Document 1</strong>: The boxer rebellion</li>
  <li><strong>Document 2</strong>: The boxer</li>
  <li><strong>Document 3</strong>: The rebellion</li>
</ul>

<p>Lets assume, our 3D axes (x, y and z) are called “rebellion”, “the”, “boxer” respectively. So, our documents can be represented in vector form as follows:</p>

<ul>
  <li>Document 1: [1, 1, 1]</li>
  <li>Document 2: [0, 1, 1]</li>
  <li>Document 3: [1, 1, 0]</li>
</ul>

<p>Here, the value 1 means that that word is present in the document and 0 means absent. There are many techniques to determine these “values” (or weights) that we’ll talk about a little later. In a 3D space our documents, now, can be visualized as below:</p>

<!-- <img src="/assets/2016-11/VSM1.png" style="display: block;margin-left: auto;margin-right: auto;"> -->
<p><img src="https://trigonaminima.github.io/assets/2016-11/VSM1.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>

<h3 id="term-weights">Term weights</h3>

<p>There’s a concept of term document matrix. Usually, where ever VSMs are used, we have a lot of documents (or at least, we assume there will be), so we build a term document matrix. Here, each row is a term and each column is a document. Hence, term-document matrix, duh! In short, our example will become as follows,</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left"> </th>
      <th style="text-align: center">doc1</th>
      <th style="text-align: center">doc2</th>
      <th style="text-align: center">doc3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>the</strong></td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">1</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>boxer</strong></td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">0</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>rebellion</strong></td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">1</td>
    </tr>
  </tbody>
</table>

<p><br /></p>

<p>Lets talk about the values with which this matrix can be filled. All of these methods (models or techniques) are built upon the <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">bag-of-words model</a>. Here, we take all the unique words (or phrases or tokens) from all the documents and create a common dictionary in no particular sequence (whether grammatically or the order in which they occur in the original text). Basically, all the words from the documents collected into a bag, hence, “bag-of-words”.</p>

<ol>
  <li>
    <p><strong>Binary Model</strong></p>

    <p>Binary model is the model, we used in our example above to create the document vectors, where each value is just an indication whether that word is occurring in that document or not. Thus, only ones and zeros. This model takes into account the presence or absence of a word in the document, but it doesn’t take anything like the number of times a word has occurred in the document. For example, in the eyes of a binary model, 2 documents - one where we talk about Facebook and another one where we talk about social networking sites in general - will be very similar (actually, exactly the same) if we were judging based on the word facebook, but in reality, the 2nd document is not that similar.</p>
  </li>
  <li>
    <p><strong>Count Model</strong></p>

    <p>Yeah, you guessed it right! Instead of having 1 or 0 for the presence or absence of a word, it gives the count of a particular word in that document. In our example, though, the term-document matrix will remain same as binary model. This model is better than the binary model as it also covers the occurrence of the words in every document. Hence, it’ll be able to differentiate between the 2 documents taken in our previous example (last paragraph).</p>

    <p><!--     ||doc1|doc2|
     | :---- | :----: | :----: |
     |**facebook**|6|2|
     |**social**|3|6|
 --></p>

    <p><!-- <img src="/assets/2016-11/count_vsm.png" style="display: block;margin-left: auto;margin-right: auto;"> -->
 <img src="https://trigonaminima.github.io/assets/2016-11/count_vsm.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>

    <p>You see? The separation has changed (from zero degrees as in binary model) between the two documents. This model still has a few limitations though. In any document of English language, there are mostly filler words, words like - a, an, the, then, that, by, of. These words, in particular, doesn’t give any indication about the theme a document might be talking about. These words are called <a href="https://en.wikipedia.org/wiki/Stop_words">stop words</a>. Even if, these words are filtered out before applying the model, there are situations where, a thematic word might become a stop word for our use case. For example, if our corpus contains all the articles about instrumental music, then having the word “instrumental” in our vocabulary wont be much helpful as we already know our data is about instrumental music and the word instrumental is bound to occur in abundance in our data. Thus, it is a stop word for our use case. Another limitation is that the word count can be anything. By anything, I mean, there can be a document with a very high count of a word and the count of same word in another document can we very low or may be even zero. What this does is, it kind of, amplifies the separation between the documents. If we take the above figure, we can see that, both documents could have been more similar but the separation was big between the documents.</p>
  </li>
  <li>
    <p><strong>Log-freq Weighing Model</strong></p>

    <p>This model is a slight variation of the Count Model to counter the skewness of the separation. Here, instead of the word frequency in each document, \(\displaystyle \mathrm {1+log_{10}(word\_freq)}\) is taken. So, according to a logarithm graph, it gives high values for lower frequencies but lower values with high frequencies (although, higher than the values obtained from the lower frequencies). So, our term-document matrix now becomes,</p>

    <p><!--     ||doc1|doc2|
     | :---- | :----: | :----: |
     |**facebook**|1.778|1.301|
     |**social**|1.477|1.778|
 --></p>

    <p><!-- <img src="/assets/2016-11/log_freq_vsm.png" style="display: block;margin-left: auto;margin-right: auto;"> -->
 <img src="https://trigonaminima.github.io/assets/2016-11/log_freq_vsm.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>

    <p>Now, this decreased the separation between the documents significantly, but it still doesn’t handle the stop words issue.</p>
  </li>
  <li>
    <p><strong>Term Frequency-Inverse Document Frequency Model (tf-idf)</strong></p>

    <p>This is the model which covers all the limitations of the models described above. In the model name, <em>term frequency</em> is, as the name suggests, the frequency of occurrence of each term in the document, whereas, <em>inverse document frequency</em> is to counter the effect of that term according to the specificity of that term across all of the documents. Mathematically, <u>term frequency</u> (\({\displaystyle \mathrm {tf} (t,d)}\)) can be calculated using any of the following formulas.</p>

    <ul>
      <li><strong>Raw frequency</strong>
  Just the ratio of term count to the total word count of the document.</li>
    </ul>

\[{\displaystyle \mathrm {tf} (t,d)={ \frac {term\_count} {|d|}} }\]

    <ul>
      <li><strong>Log normalized frequency</strong>
 Exactly the same as described in the log normalized model.</li>
    </ul>

\[{\displaystyle \mathrm {tf} (t,d)=1+log_{10}(word\_freq) }\]

    <ul>
      <li><strong>Double normalization k (0 &lt; k &lt; 1)</strong>
 This is made to prevent the bias towards the longer documents. The formula effectively is, the term’s raw frequency divided the maximum raw frequency of any term in that document.</li>
    </ul>

\[{\displaystyle \mathrm {tf} (t,d)=k+(1-k)\cdot {\frac {f_{t,d}}{\max\{f_{t',d}:t'\in d\}}}}\]

    <p>And similarly, <u>inverse document frequency</u> (\({\displaystyle \mathrm {idf} (t,D)}\)), which is a <a href="https://en.wikipedia.org/wiki/Logarithmic_scale">log normalized</a> inverse fraction determined using the following expression where \({\displaystyle \mathrm N}\) is the total number of documents and the <strong>denominator</strong> denotes the number of documents having the term t. If this count is 0 then a value of 1 is taken as the adjusted count. You see, the ratio of, count of, total documents to the documents having a particular term, determines the specificity or the exclusivity or commonness of that term across our corpus. Thus, if this exclusivity is better (meaning the term is rare in our corpus) then the idf will be larger and vice versa.</p>

\[{\displaystyle \mathrm {idf} (t,D)=\log {\frac {N}{|\{d\in D:t\in d\}|}}}\]

    <p>Finlly, the tf-idf score is calculated by just multiplying \({\displaystyle \mathrm {tf}}\) and \({\displaystyle \mathrm {idf}}\). Now, lets calculate the first value using tfidf.</p>

\[{\displaystyle \mathrm {tfidf} (facebook,doc1,D)= {\frac {6} {9}}\cdot log \frac {2} {2} = {\frac {6} {9}} \cdot 0 = 0}\]

    <p>Unfortunately, all the values for our new term-document matrix will come out to be zero resulting in both documents being exactly same in similarity metric. I didn’t see that coming, but really we will never have documents like these in reality.</p>
  </li>
</ol>

<h3 id="the-need-for-vsms">The need for VSMs</h3>

<p>The domain of Natural Language Processing deals with the text, a collection of words. There are no numbers to deal with. In images, for example, there are RGB values of each pixel to deal with which are encoded in numbers. In the case of text, there is nothing but words and characters. We have to find out ways to convert those <em>features</em> into numbers so that our algorithms can do some computations on it. And, converting a document into a VSM opened up some new avenues for us. Firstly, there are linear algebra concepts that can be applied to our computations, to make them efficient and scalable, to arrive at the results. Secondly, by converting our documents into a space vector we got some neat techniques (from n-dimensional mathematics) under our belt to arrive at some new insights. We can determine euclidean distance between two points telling us about the distance between 2 documents (this is not a good idea - <a href="https://youtu.be/ZEkO8QSlynY">watch this</a>). We can, also, find out the angle between two vectors, representing two different documents, which can be taken as a proxy for the similarity between the documents. For example, lets say, we have one more document in our previous set of documents,</p>

<ul>
  <li><strong>Document 1</strong>: The boxer rebellion [1, 1, 1]</li>
  <li><strong>Document 2</strong>: The boxer [0, 1, 1]</li>
  <li><strong>Document 3</strong>: The rebellion [1, 1, 0]</li>
  <li><strong>Document 4</strong>: Rebellion [1, 0, 0]</li>
</ul>

<p>Now, if we consider these pairs individually, then for each pair we can intuitively say that,</p>

<ul>
  <li>(D1, D4): There’s just one word in common and they are kinda similar in meaning.</li>
  <li>(D2, D4): There are no common words and they mean completely different as well.</li>
  <li>(D3, D4): Here, there’s one common word and they almost mean the same.</li>
</ul>

<p>Now, if I ask, what should be the separation between these pairs? Intuitively, one can say that, D2 and D4 will be farther from each other, D1 and D4 will be a bit closer, and D3 and D4 will be more closer. Indeed, we see the similar pattern, with angles as 45°, 90°, and ~54° respectively.</p>

<!-- <img src="/assets/2016-11/VSM_angles.png" style="display: block;margin-left: auto;margin-right: auto;"> -->
<p><img src="https://trigonaminima.github.io/assets/2016-11/VSM_angles.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>

<p>A similar calculation can be applied on the vectors of the size of the number of documents on the corpus. Thus, for each word there will be a vector in a m-Dimensional space where m is the number of documents. And, using the same similarity score calculation, we can find out the <em>statistical synonyms</em> of each word from our corpus. These word vectors, if I am correct, can also be called <em>word embeddings</em>, but more about that in some other time.</p>

<h3 id="so-a-quick-summary">So, a quick summary?</h3>

<ol>
  <li>We have a n-dimensional vector space where n is the size of our vocabulary.</li>
  <li>Each term (word, keyword, phrase, etc) in our vocabulary is an axis.</li>
  <li>Documents are points or vectors in this n-D space.</li>
  <li>With this encoding, we can apply vector mathematics by creating a term-document matrix.</li>
  <li>There are many methods of obtaining each value of these document vectors, namely, binary, count, log normalized and tfidf.</li>
  <li>tfidf is the most practical and useful score.</li>
  <li>With this vector representation we are able to calculate similarity between documents which can be used in Information Retrieval or Recommender Systems.</li>
</ol>

<h3 id="doing-it-in-python">Doing it in Python</h3>

<p>What’s the point of all the above theory if we are not going to actually apply it? This section will explain the generation of a term-document matrix out of our documents using the machine learning python library, <a href="http://scikit-learn.org/stable/">scikit-learn</a>. Later in this article we will make a recommendation system using our term-document matrix made in this step. Lets start.</p>

<p>So, to create a term-document matrix there’s a direct implementation in scikit-learn library. It is defined in <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform">sklearn.feature_extraction.text.TfidfVectorizer</a>. Let our documents be.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">docs</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">"The boxer rebellion"</span><span class="p">,</span>
    <span class="s">"The boxer"</span><span class="p">,</span>
    <span class="s">"The rebellion"</span>
<span class="p">]</span></code></pre></figure>

<p>The vectorizer can be defined as follows:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>

<span class="n">args</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"stop_words"</span><span class="p">:</span> <span class="s">"english"</span><span class="p">,</span>
    <span class="s">"lowercase"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="s">"norm"</span><span class="p">:</span> <span class="s">"l2"</span><span class="p">,</span>
    <span class="s">"use_idf"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="s">"smooth_idf"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="s">"sublinear_tf"</span><span class="p">:</span> <span class="bp">True</span>
<span class="p">}</span>

<span class="n">vectorizer</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="o">**</span><span class="n">args</span><span class="p">)</span></code></pre></figure>

<p>In the <code class="language-plaintext highlighter-rouge">args</code> dictionary defined above, <code class="language-plaintext highlighter-rouge">"stop_words": "english"</code> makes the vectorizer to remove the stop words from the documents. So, “the” from our documents will be removed before calculating the tfidf scores. The option for <code class="language-plaintext highlighter-rouge">lowercase</code> converts all the characters into the lower case. With the use of <code class="language-plaintext highlighter-rouge">"use_idf": True</code> option we will get the <u>tfidf scores</u> else we would have gotten the <u>tf scores</u>. <code class="language-plaintext highlighter-rouge">"smooth_idf": True</code> means one will added to the document frequencies so that division by zero can be prevented. <code class="language-plaintext highlighter-rouge">"sublinear_tf": True</code> replaces <code class="language-plaintext highlighter-rouge">tf</code> with \(1 + log(tf)\) (log normalization). The <code class="language-plaintext highlighter-rouge">norm</code> option defines the normalization of the vectors made; <code class="language-plaintext highlighter-rouge">l2</code> means the vectors are normalized by the euclidean norm, formally, given as.</p>

\[{\displaystyle \mathrm v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}} }\]

<p>There are more options, which are explained in this <a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction">user guide</a> provided by the scikit-learn documentation. Now, it’s time to get the tfidf term-document matrix.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">train_tdm</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
<span class="c1"># ['boxer', 'rebellion']
</span>
<span class="k">print</span><span class="p">(</span><span class="n">train_tdm</span><span class="p">.</span><span class="n">toarray</span><span class="p">())</span>
<span class="c1">#[[ 0.70710678  0.70710678]
# [ 1.          0.        ]
# [ 0.          1.        ]]</span></code></pre></figure>

<p>Thus, we got the term-document matrix for our documents. The use of this matrix will be shown later in the article when we will be getting recommendations based on documents upon which we just made the matrix.</p>

<h2 id="recommender-systems">Recommender Systems</h2>

<p>You can get a basic understanding of recommendation systems from this article - <a href="https://www.analyticsvidhya.com/blog/2015/10/recommendation-engines/">Understanding basics of Recommendation Engines (with case study)</a>. It’s lacking a bit here and there, but still a <em>simple introduction</em> to the topic. I’ll still give a quick summary about the topic and then move on to the kind of recommendation engine I want to construct for my personal usage.</p>

<h3 id="what-is-it-1">What is it?</h3>

<p>Recommender Systems are something which recommend you some new content - book recommendations (<a href="https://www.goodreads.com">Goodreads</a>), movie recommendations (<a href="https://www.netflix.com/in/">Netflix</a>), product recommendations (Amazon) and many more such things - based on your content history and/or users similar to you.</p>

<p>There are two major major kinds of recommendations engines:</p>

<ol>
  <li>
    <p>Content-Based Recommendation Engines</p>

    <p>Here the recommendations are based on your past history. So, to give you any recommendations, the system has to know what are the items you have liked or disliked previously, to get patterns out. To understand them a bit better you can have a look at <a href="https://www.analyticsvidhya.com/blog/2015/08/beginners-guide-learn-content-based-recommender-systems/">Beginners Guide to learn about Content Based Recommender Engines</a>.</p>
  </li>
  <li>
    <p>Collaborative Recommendation Engines</p>

    <p>Here, the system finds users similar to you, based your history or based on some common data (location in world, age, gender, etc), and then recommend the items that others have liked but you haven’t yet seen them.</p>
  </li>
</ol>

<p>These are the two kinds there are, now, you can use any of them or a combination of those to suit your application.</p>

<h3 id="what-i-am-up-to">What I am up to?</h3>

<p>I try to follow a lot of technical blogs to keep myself updated with the technological changes and interesting stories (usually, technical) around the internet. But I was facing the issue that I wanted all the articles at one place and not go to the individual sites to see them. I could have easily used some blog aggregator service or, may be, made a very basic <em>new blog articles fetcher</em>, but then there was the issue that, not every article of every blog is awesome. And, I definitely have limited time to go through each of them. So, I though to make a system to fetch all the latest articles published and recommend me the ones which align with my interests. And, that’s what I am building. I have thought of a few standard techniques to implement and then I’ll test each out out live. Then lets see where this goes. Anyway, since, I am the only user in the system, this recommendation engine is going to be <strong>content-based</strong> one.</p>

<p>First thing I am gonna be implementing is the very basic version of a content-based recommendation engine. I have some 100+ articles that I like. So, this is going to be my corpus. Since we are talking about blog recommendation system, there has to be vector space models for each document. Training my recommendation engine upon those VSMs, it is going to recommend me the next great articles out of the new ones. The code above gives me the tfidf based VSMs for my whole corpus. Now, the next step is to fetch the new article content, create a VSM for that article based on the vocabulary of my training corpus and then find the similarity (cosine similarity) between each liked article to the new article. Now, if the maximum similarity score among all the scores, passes a certain threshold, then it’ll be recommended to me, else discarded.</p>

<p>There were a few things that should be discussed here.</p>

<ol>
  <li>
    <p><strong>What is cosine similarity and why is it being used?</strong></p>

    <p>In the <strong>The need for VSMs</strong> section, I discussed about the angle being the representative of the level of similarity between the articles. Cosine similarity is the measure of that angle. The domain of the angle between articles will be from 0° to 180° where 0° means the most similar vectors. And, cosine value at an angle of 0° gives a value of 1 (meaning most similar) and that of 180° as -1 (most dissimilar). Thus, the cosine value goes from 0 to 1, if the similarity increases between documents. The calculation of cosine of an angle is also easier than calculating the exact angle. So, it helps in many ways by using the cosine similarity.</p>
  </li>
  <li>
    <p><strong>How does using words as features (bag-of-words) help?</strong></p>

    <p>Bag-of-words model depends on the so-called <strong><a href="https://en.wikipedia.org/wiki/Distributional_semantics">Distributional Hypothesis</a></strong> concept in linguistics, which states that,</p>

    <blockquote>
      <p>linguistic items with similar distributions have similar meanings.</p>
    </blockquote>

    <p>Now, here we don’t take the order of the words in the document or the semantics of the document into account. But based on distributional hypothesis, we can assume that the articles which have similar distribution of words are similar in meanings. Furthermore, the count of the words might change the meaning/intent of the articles which is handled by the tfidf measure.</p>

    <p>And, if this still doesn’t work, we can use word level bi-grams (or tri-grams) to take into account some of the document’s structural context. In fact, this option is one of the things, I have on my list to try to see if it gives me better results.</p>
  </li>
</ol>

<h3 id="again-coming-to-python-implementation">Again, coming to Python implementation</h3>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>

<span class="n">args</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"stop_words"</span><span class="p">:</span> <span class="s">"english"</span><span class="p">,</span>
    <span class="s">"lowercase"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="s">"norm"</span><span class="p">:</span> <span class="s">"l2"</span><span class="p">,</span>
    <span class="s">"use_idf"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="s">"smooth_idf"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="s">"sublinear_tf"</span><span class="p">:</span> <span class="bp">True</span>
<span class="p">}</span>
<span class="n">vectorizer</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="o">**</span><span class="n">args</span><span class="p">)</span>

<span class="n">docs</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">"The boxer rebellion"</span><span class="p">,</span>
    <span class="s">"The boxer"</span><span class="p">,</span>
    <span class="s">"The rebellion"</span>
<span class="p">]</span>
<span class="n">train_tdm</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span></code></pre></figure>

<p>Above code snippet is our previous code written together. It prepares a tfidf based term-document matrix from our set of documents. This term-document matrix has a vocabulary defined from the words found in all the docs. Now, lets say, we got a new document - boxer in rebellion. Here “in” was not the part of the vocabulary of our training set so this word will be ignored while creating the term-document matrix of this new document against the training data.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">new_doc</span> <span class="o">=</span> <span class="p">[</span><span class="s">"boxer in rebellion"</span><span class="p">]</span>
<span class="n">test_tdm</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">new_doc</span><span class="p">)</span></code></pre></figure>

<p>Note, here we used the method <code class="language-plaintext highlighter-rouge">vectorizer.transform()</code> instead of <code class="language-plaintext highlighter-rouge">vectorizer.fit_transform()</code>. The latter creates a vocabulary and returns a term-document matrix according to that vocabulary whereas, the former only returns the term-document matrix based on the vocabulary of the new document passed. Thus, here any new word in the new document will be ignored. We’ll calculate the similarity of this new document to every document in the training data.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.metrics.pairwise</span> <span class="kn">import</span> <span class="n">linear_kernel</span>

<span class="c1"># train_tdm is a 3X2 matrix and test_tdm is a 1X2 matrix.
# The following code returns a 3X1 matrix
</span><span class="n">similarities</span> <span class="o">=</span> <span class="n">linear_kernel</span><span class="p">(</span><span class="n">train_tdm</span><span class="p">,</span> <span class="n">test_tdm</span><span class="p">)</span>
<span class="c1">#array([[ 1.        ],
#       [ 0.70710678],
#       [ 0.70710678]])
</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">similarities</span><span class="p">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="bp">None</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># 0
</span>
<span class="n">score</span> <span class="o">=</span> <span class="n">similarities</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
<span class="c1"># array([ 1.])
</span>
<span class="n">article</span> <span class="o">=</span> <span class="n">docs</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
<span class="c1"># 'The boxer rebellion'</span></code></pre></figure>

<p>The method <code class="language-plaintext highlighter-rouge">linear_kernel()</code> multiplies one vector to another. You might have noticed that we needed <em>dot product</em> to calculate the cosine similarity whereas, we have just multiplied the two vector here. The reason is, while preparing the term-document matrix, we provided an argument, <code class="language-plaintext highlighter-rouge">"norm": "l2"</code> to <code class="language-plaintext highlighter-rouge">TfidfVectorizer()</code>, this calculated the normalized values of the vectors. Thus, we only had to do the simple multiplication of 2 vectors to get the dot product.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The aim of this post was to document the internals of the Vector Space Models and the basic understanding of the recommendation engine in a barely working form. This is the most basic recommendation engine that can be built. There’s nothing new being done here. In future posts, I am going to write about some more sophisticated techniques that should work better than this basic one. But before exploring (and testing) those techniques, I have to make a complete working system, where I am gonna use this recommendation-engine. I am creating a sort of utility (bot/intelligence/etc) where I have many bots running on various slack channels which will do specific things, giving me new articles to read, for instance. I chose <a href="https://slack.com/">Slack</a> because then I can access the platform from my android, web or laptop. I have to figure out some design details about this whole system. I’ll talk about that whole thing in detail in some other post.</p>]]></content><author><name>Shivam Rana</name></author><category term="RecSys" /><category term="NLP" /><summary type="html"><![CDATA[Vector Space Models (VSM)]]></summary></entry><entry><title type="html">Into the Wild: A Review</title><link href="https://trigonaminima.github.io/2016/10/into-the-wild-review/" rel="alternate" type="text/html" title="Into the Wild: A Review" /><published>2016-10-09T00:00:00+00:00</published><updated>2016-10-09T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2016/10/into-the-wild-review</id><content type="html" xml:base="https://trigonaminima.github.io/2016/10/into-the-wild-review/"><![CDATA[<p>I once heard on a podcast that, when you like something, make it a practice to think of the reason(s) why you like it. If I remember it correctly, this was a strategy to keep your sceptical nature in check. Anyway, sceptical or not, I think this is a good practice.</p>

<p>Recently, I finished reading <a href="https://en.wikipedia.org/wiki/Into_the_Wild_(book)">Into the Wild</a> by Jon Krakauer. It’s a biography, true travel essay about an adventurer who went off the grid to live like a tramp ending this long trip by living in Alaska. Although, a bit bogged down, at a few places, with too many details about the places mentioned that it was not easy for me to imagine, it was an interesting read.</p>

<h2 id="the-author">The Author</h2>

<p><a href="https://en.wikipedia.org/wiki/Jon_Krakauer">Jon Krakauer</a> is a writer and mountaineer and there’s even a chapter in the book regarding his life as a mountaineer where he described his extremely dangerous and loneliness filled climb of the Devils Thumb through a never been traveled before route. Probably, the author related to the main character, Chris McCandless, of the book at a personal level, inspiring him to write an entire book about him. The level of inspiration (or possession?) is kind of apparent from the fact that Krakauer first wrote an article reporting the unfortunate death of McCandless but later expanded it to this book. He researched the entire life of McCandless- talked to his family, friends and all the people McCandless made friends with on his way to Alaska, trying his best to understand the reasons, the ideals, the motives which guided this guy to life a life of a vagabond. Indeed, everyone, who will hear about McCandless’ story will think about that. Krakauer, although, took it a step further, maybe because he is a bit like McCandless.</p>

<h2 id="the-title">The Title</h2>

<p>What does the word, “wild” mean? By dictionary, it means an inhospitable, uncultivated or uninhabited place. One can also call wild, a pure or untouched place. A place which is yet to be destroyed by humanity. It can be someplace unknown to us.</p>

<blockquote>
  <p>“I want movement, not a calm course of existence. I want excitement and danger and the chance to sacrifice myself for my love. I feel in myself a superabundance of energy which finds no outlet in our quiet life.” - Leo Tolstoy</p>
</blockquote>

<p>Into the Wild. The title, to me, was perfect to gather the whole essence of the book. The absolute meaning can be to go into the wild, inhospitable place. It can be interpreted, though, in many ways. It can mean, getting out of the comfort zone to just go into the unknown. It can also mean, to be free and not being shackled by the constructs of the society. It can also mean a life in motion. And, this is what you’ll feel if you read the book - a compilation of the stories of the wild.</p>

<h2 id="the-book">The book</h2>

<p>The book describes the journey of a young man who started out on a life changing journey without informing anyone from the family and finally ending the trip in Alaska. It is written in a biography from sprinkled with quotes and stories related to the theme. The author has given an inside to his thoughts on the various situations and has derived many parallels from the lives of many, including his. The book was mostly in the sequence of actual events and at points where it wasn’t, I didn’t feel the existence of any gaps. It was even sort of engaging, thinking about when will that part be picked up again.</p>

<p>The author, in the book, points out that from the mails he got to the original magazine piece he wrote, many didn’t like the glorification of the death of someone who didn’t care about his life. That he was suicidal. But, throughout the book, Krakauer has tried to find the answers of the questions which might have put some light on the psyche of the McCandless. He also didn’t hold back from writing about the mistakes McCandless did.</p>

<p>I think, the author’s aim was different by writing this book. Yes, he wanted everyone to know about the life of Chris McCandless. How he lived, how he connected with everyone, how charming he was, how what he did was brave and something exotic. But, more importantly, he wanted everyone to learn from his life. I think, he wanted every reader to understand the importance of engaging with the nature (obviously, not to the extent of McCandless’ or his journey to Alaska). I think, he wanted us to see how our connections with others are very important to enjoy the life. And, how everyone close to us suffer if our decisions turn out to be deadly (figuratively or literally).</p>

<h2 id="the-main-character">The Main Character</h2>

<p>Chris McCandless, the character whose life we read about, seems like an intelligent young man who never liked to work according to rules. He despised his parents but was very close to his sister. Throughout the book, Chris, is shown to have good relationships with anyone who helped him during his journey, but every time it started getting more serious he’d resume his journey leaving those friends behind. This passage indeed puts it clearly.</p>

<blockquote>
  <p>McCandless was thrilled to be on his way north, and he was relieved as well - relieved that he had again evaded the impending threat of human intimacy, of friendship, and all the messy emotional baggage that comes with it. He had fled the claustrophobic confines of his family. He’d successfully kept Jan Burres and Wayne Westerberg at arm’s length, flitting out of their lives before anything was expected of him. And now he’d slipped painlessly out of Ron Franz’s life as well.</p>
</blockquote>

<p>McCandless although, understood, but despised money and the role it played in the society. It comes as no surprise when we got to know that before starting his odyssey he donated all the money he had in his bank account and burned all the money he had in his pockets. He only took the essentials and just started his journey. For most part of the journey, Chris was of a mindset that he was really happy and enjoying his life by living on his terms. By the end, although, in Alaska, he seemed to have realized the importance of human connections in one’s life. The set of events that happened to McCandless were unfortunate but if he would have been able to get out of wilderness safely, this whole trip would have really been an Odyssey. An epic journey.</p>

<p>There’s part of the book where Chris’ love for running is depicted. He’d get sad if he thought he didn’t perform well during any race he has run and would not talk to anyone for hours. From the accounts of his friends, it seemed like, running was almost a religion to Chris McCandless. As if, he was mental, in a positive way, about that sport.</p>

<blockquote>
  <p>McCandless viewed running as an intensely spiritual exercise, verging on religion. “Chris would use the spiritual aspect to try to motivate us,” recalls Eric, another friend on the team. “He’d tell us to think about all the evil in the world, all the hatred, and imagine ourselves running against the forces of darkness, the evil wall that was tying to keep us from running our best. He believed doing well all mental, a simple matter of harnessing whatever energy was available. As impressionable high school kids, we were blown by that kind of talk.”</p>
</blockquote>

<h2 id="the-ending">The Ending</h2>

<p>The book ends with the description of the Chris’ parents coming to peace with their son’s death in Alaska. They visited the site where their son died on the Magic Bus, as Chris described it the day he found the bus in the wilderness. His parents saw the same beauty and tranquility that Chris might have enjoyed, in the place where Chris spent his last days.</p>

<h2 id="the-conclusion">The Conclusion</h2>

<p>I recommend anyone who has ever thought about going out and travel out of their backpacks to read this book. The book depicts the life of a guy who started living life on his own terms in the wilderness. This book is the tale of all the things that guy did, learned and even taught. This book highlights the most important lesson for humans - HAPPINESS ONLY REAL WHEN SHARED.</p>]]></content><author><name>Shivam Rana</name></author><category term="Book" /><summary type="html"><![CDATA[I once heard on a podcast that, when you like something, make it a practice to think of the reason(s) why you like it. If I remember it correctly, this was a strategy to keep your sceptical nature in check. Anyway, sceptical or not, I think this is a good practice.]]></summary></entry><entry><title type="html">Chatting Up</title><link href="https://trigonaminima.github.io/2016/06/chatting-up/" rel="alternate" type="text/html" title="Chatting Up" /><published>2016-06-09T00:00:00+00:00</published><updated>2016-06-09T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2016/06/chatting-up</id><content type="html" xml:base="https://trigonaminima.github.io/2016/06/chatting-up/"><![CDATA[<p>So, my <a href="/2014/11/gamification-of-life/">Gamification of Life</a>, has grown into a much bigger project than I thought it was. Now, the gamification part is dropping off and the concept of personal analytics of my life is gaining. Basically, I want to collect data about myself and analyse it to be kinda self-aware. At the end, make a dashboard that shows me the basic stats and maybe help find patterns where I can improve myself. This post will talk about the chats I have been having for 4-5 years on various IMs mainly, WhatsApp and Facebook.</p>

<p>Lets first talk about the data. I had this chat thing in mind for a year or two. Hence, during all this time whenever, I had a conversation with someone on WhatsApp I used to email it to myself. To reduce my work, I do it whenever I think a particular chat has been going on for a long time. There was no other method to get them out of the app. Then, I manually downloaded every text file and ran a script which gave me the the clean CSVs. For the purposes of this post, I have only extracted the timestamps and the person sending the replies.</p>

<p>Getting Facebook chats data was a time consuming task. I tried a few non-coding ways to get all the data. Something that might give me the chats in the form of text files but none helped. I tired a chrome plugin <a href="https://chrome.google.com/webstore/detail/facebook-chat-downloader/kflkdhmijdgjnlbdkfgdmolcjnflmlhf?hl=en">Facebook Chat Downloader</a>. This was useful to get around 30-40k messages. But, I had a friend with who I have talked more than that. Now, even if I had chosen this method, I wasn’t sure how to get that all the data from that person. Besides, there was a lot of manual labor there. Then, I thought, lets download my Facebook data archive, they say that there will my chat data included. But, as expected, they weren’t complete. Facebook is quite restrictive towards getting your own data out of the Facebook. Their graph API is restrictive as well. I have never seen any dataset of Facebook in public domain. It kind of makes sense for Facebook as user data is the fuel they are running on. But, as a user (or, more like, as a data nerd) this brings some issues. Since, I couldn’t get my data out with the proper channels, I had to look at the requests the browser was making to mimic the same requests with my code. This takes some time to get the parameters right. Data scraping mostly deals with this concept - figure out the requests your browser is making to show you the data you want scraped and then mimic the requests with the code, giving the server an impression that you are the one making the requests. Thus, I had huge (not in size of file, but in the length of data) JSONs which were converted into the respective CSVs. I even scraped the data of group chats.</p>

<p><img src="https://trigonaminima.github.io/assets/2016-06/mychats.jpeg" alt="My Chats" /></p>

<p>The above plot shows the messages I sent to any of my friend on WhatsApp or FB. The year 2012 (mid 2012, to be more precise) was the period I started being a bit more active on Facebook than just having an account. Interestingly, that’s also the year I started college. The blank each day comes from the sleeping time and college time, except weekends of course. That gap in January 2016, covering almost full January, was when I went for a holiday and after returning I got busy with my internship.</p>

<p>Man, I thought this plot will look great, but it looks okay-ish. I am all over the place. My inconsistent sleeping pattern, majority of activity during the late nights, even stretching to, 5 or 6’O clock in the morning, pretty much shows that I am a night owl. I am trying to change that though. I try to sleep early every day, but then I remember to do something that needs to be done like, writing this post at 3:40 am while listening to <a href="https://www.youtube.com/watch?v=SqChTn4PNuA">Human Qualities</a> of Explosions in the Sky).</p>

<p><img src="https://trigonaminima.github.io/assets/2016-06/daily.jpeg" alt="Daily Messages" /></p>

<p>After joining the college, I talked pretty much to someone who is a very good friend of mine now. Very knowledgeable dude with whom I had interesting conversations with. The messages in the year 2012 is pretty much with him. At the end of 2012, I picked up the pace and made some new friends. By 2013, I was very busy talking. After that, I went through something rough and I dropped quite low and then again picked up the pace. After that, with sudden high spikes I am pretty much consistent in talking.</p>

<p>All in all, out of a whopping 90000 messages sent by me, I was most chatty in the years 2013 (24500 messages) and 2015 (26000 messages). I almost dropped sending messages by 30% from 2013 to 2014. Damn! I didn’t realize that rough episode was this rough. This was a new insight. And, the year 2016 is going pretty good relative to the previous years. I am talking more this year. Interesting. Talks about college major project, internship, seniors add ups quickly.</p>

<p><img src="https://trigonaminima.github.io/assets/2016-06/friends.jpeg" alt="Monthly unique friends" /></p>

<p>My talking to just one friend during the later half of 2012, kind of shows in this plot with monthly unique friend average reaching to almost 2 except the last month or so. In 2013, I made many friends with the average reaching to more than 6. Naturally, I was going to converse with this many friends which shows in the previous plot. 2013 was a good year. Then, that rough thing happened and man, I reduced the talking. In the most chatty year 2015, that peak at around September was interesting to find about. That was when the college placement season started. That month was quite active. I was a placement coordinator. I was confused about my career choices. There were other things on my mind too. Consequently, during that month, I talked to a total of 47 different friends, college seniors, mentors, coming to an all time high, monthly average of more than 9 people.</p>

<p><img src="https://trigonaminima.github.io/assets/2016-06/monthly.jpeg" alt="Incoming and outgoing" /></p>

<p>I always knew, that I don’t write much in the replies. I usually, tend to favor one word replies where ever I can. So, the plot above shows the truth in that, but it’s also a bit misleading. In my data there was also Facebook group chats. Clearly, in that case, incoming messages are more than whats coming in. But, the above plot generally shows my preference to send less number of texts and it also clearly shows the trend of the increase of my messages with the incoming messages, which, I think is obvious. This also shows that, I don’t ignore the texts of others. Just kidding, this plot barely shows that. There’s also some interesting points in the plot where average incoming and outgoing messages were equal. Well, I’ll look at that some other time.</p>

<p>There are a lot of things that can be done with this of data. I guess, I have just touched the surface. So, that good friend I talked to during most of the 2012, we have the largest number of messages totaling around 50,000 messages (incoming and outgoing). When I was calculating the averages per friend given days we have talked, the “score” for that friend came out to be pretty low. And, the score with someone (who do you think that someone is?), I have been talking a lot lately came out to be quite high. I found this quite interesting and quite obvious too. If the messages have been spread out to a larger range of dates then the score will be less and vice versa. Of course, the number of messages have to a bit significant in both cases. This “score” is kind of, a representative of “density” of messages on the dates. It’d be fascinating to make some visualization out of it to see whose “density” is changing with time or density throughout the history of chats and further more trends I can’t think of right now. I guess, that’s work for some other time.</p>

<p>As a part of the Quantified self, I also log my sleep times. It’d be amusing to look at how this data aligns with my sleep. In general, I think, after ending the last talk at night, I am up for 2-3 hours more. It’ll be really interesting to find the actual number. My another post about sleep times is due. I will definitely talk about this then. My present preference is to make the dashboard, mentioned at the start of the post. And, while working on this post, I also learned d3js. I was planning to make an interactive plot and adding it here, but then there was a lot of personal data so, I chose against it. Instead, I am going to use d3 in my dashboard. I hope I get time to work on the dashboard.</p>

<p>PS: Code for the above analysis and scripts to get the data can be found here - <a href="https://github.com/TrigonaMinima/Chats">Chats</a></p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-Self" /><category term="Data-Analysis" /><summary type="html"><![CDATA[So, my Gamification of Life, has grown into a much bigger project than I thought it was. Now, the gamification part is dropping off and the concept of personal analytics of my life is gaining. Basically, I want to collect data about myself and analyse it to be kinda self-aware. At the end, make a dashboard that shows me the basic stats and maybe help find patterns where I can improve myself. This post will talk about the chats I have been having for 4-5 years on various IMs mainly, WhatsApp and Facebook.]]></summary></entry><entry><title type="html">[Mini] Idea Debt</title><link href="https://trigonaminima.github.io/2016/03/idea-debt/" rel="alternate" type="text/html" title="[Mini] Idea Debt" /><published>2016-03-05T00:00:00+00:00</published><updated>2016-03-05T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2016/03/idea-debt</id><content type="html" xml:base="https://trigonaminima.github.io/2016/03/idea-debt/"><![CDATA[<p>I came across this <a href="http://jessicaabel.com/2016/01/27/idea-debt/">blog post</a> on idea debts. Many things it talked about, were also happening with me.</p>

<p>I have a habit of jotting every “important” thought that comes to my mind in the Google Keep. Consequently, I also have a note listing ideas or things that I think, I should do, to learn and grow. Things not like travelling or music or something, but things like, coding this <em>amazing</em> module or to work on raspie and build this new thing which looks so fucking cool in my mind.</p>

<p>But, over time, I realized that I have a long list and I rarely tick off any item. From time to time, I kept having dreams the awesomeness those ideas will bring to my life once they are complete. But, as the article suggests it was the idea debt I was accumulating. So, I went on pruning my list. At some level, I think, the remaining ones will have the same fate, but I don’t want to part with them yet. And, because, I now, have reduced the list, down to a few items, I think I’ll be slightly less overwhelmed and will definitely work on something.</p>

<p>Anyway, it did help me to get away with some ideas. If anyone is reading it (yes, future me, you too!!), you should go about and prune that list of yours.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[I came across this blog post on idea debts. Many things it talked about, were also happening with me.]]></summary></entry><entry><title type="html">[Mini] Linux hardware Issues</title><link href="https://trigonaminima.github.io/2016/02/linux-hardware-issues/" rel="alternate" type="text/html" title="[Mini] Linux hardware Issues" /><published>2016-02-06T00:00:00+00:00</published><updated>2016-02-06T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2016/02/linux-hardware-issues</id><content type="html" xml:base="https://trigonaminima.github.io/2016/02/linux-hardware-issues/"><![CDATA[<p>OK, so this is my first mini. Idea of mini was brought on by the concept of shorts in the <a href="https://cs50.harvard.edu/">CS50 course</a> and the periodic MINI episodes on the <a href="http://dataskeptic.com/">Data Skeptic</a> podcast.</p>

<p>So, I got a new laptop (HP ProBook 440 G2) this January with the following specs - Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 8GB RAM, 1TB HDD, Realtek wireless drivers. It came with Windows 8.1. My survival depends on Linux. So, I partitioned my HDD, installed <a href="https://elementary.io/">Elementary OS Freya</a>.
Everything worked fine while checking the live USB, but when I started the OS (after turning off the secure boot option, which took me 2-3 hours to find that to get the GRUB working, I have to disable secure boot), the first thing, I encountered, was that the logo displayed before login screen was highly pixelated. I thought maybe it’ll be alright at next boot. So, I logged in and did some more exploring. Changed all the settings according to my liking. Connected to internet and started installing my tools on the terminal.</p>

<p>Now began my troubles. After some time, network disconnected. I couldn’t even reconnect it. I restarted the laptop, saw the pixelated  <code class="language-plaintext highlighter-rouge">e</code> again and the same network problem. I was furious. I searched the problem, found some solutions, but none solved mine. I thought may be installation was buggy. So, I reinstalled the OS, but the problem persisted.</p>

<p>I went to bug trackers of the eOS and found 2-3 bugs relating to the same problem. But no solution was provided. In this search, luckily, I found the same problem being discussed on <a href="https://voat.co/v/Linux/comments/500838">voat.co</a>, a service similar to Reddit. There I found the solution.</p>

<p>So, I learned something from this. As my laptop was new and everything was working fine on windows, I automatically thought that not being able to connect to the internet was the error caused by the eOS. And, I guess, many thought the same, looking at the bugs on the bug tracker. But, this problem goes beyond the OS. The problem of network was because my WiFi card was comparatively new and there’s a common problem with the Linux kernel and Realtek wireless chipsets where the power management is completely broken for them. So, the solution was to patch the kernel for RTL devices. I did that and tested for 2-3 days and my network is pretty smooth now. To get the instructions <a href="http://elementaryos.stackexchange.com/q/4133/3898">see this</a>.</p>

<p>The first problem of getting a pixelated logo was kinda resolved during the process. So, in a Ubuntu stack exchange, as a solution to the second problem, someone had suggested to update the kernel. My present version was 3.19. I updated the kernel and after the next boot-up the problem was resolved. So, my takeaway from this ordeal is-</p>

<ol>
  <li>Don’t always blame the OS for the errors with the kernel/device drivers. (But I’d say, developers could have commented on the bugs about the patch.)</li>
  <li>Always, update your kernel on every fresh Linux distro installation.</li>
</ol>]]></content><author><name>Shivam Rana</name></author><category term="Linux" /><summary type="html"><![CDATA[OK, so this is my first mini. Idea of mini was brought on by the concept of shorts in the CS50 course and the periodic MINI episodes on the Data Skeptic podcast.]]></summary></entry><entry><title type="html">Gene Regulatory Networks</title><link href="https://trigonaminima.github.io/2015/12/grn/" rel="alternate" type="text/html" title="Gene Regulatory Networks" /><published>2015-12-31T00:00:00+00:00</published><updated>2015-12-31T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2015/12/grn</id><content type="html" xml:base="https://trigonaminima.github.io/2015/12/grn/"><![CDATA[<p>Here goes the documentation of the work I did during my summer internship. Yeah, I know, this is coming quite, quite late, but hey! Better late than never! All the coding was done in R (<a href="https://github.com/TrigonaMinima/Genetic-Regulatory-Networks/tree/research-work">check out the repo</a>). There will also be another post which will talk about the technical (basically, R code) details.</p>

<h2 id="objective">Objective</h2>
<p>A <a href="https://en.wikipedia.org/wiki/Gene_regulatory_network">Gene Regulatory Network</a> (GRN) is a network between various genes and proteins that interact with each other and govern each other’s expression levels. Genes may have a positive or negative effect on other genes, it is these effects which are shown in a GRN. <a href="https://en.wikipedia.org/wiki/Microarray">Microarray Data or Gene Expression Data</a> represents the <a href="https://en.wikipedia.org/wiki/Gene_expression">expression</a> level of various genes under certain experimental conditions (either under some perturbation, or for various patients or at equal time intervals). The objective was to model the GRNs using the microarray data.</p>

<h2 id="motivation">Motivation</h2>

<h3 id="so-why-gene-regulatory-networks">So, why Gene Regulatory Networks?</h3>
<p>GRNs have an important role in every process of life (for eg; cell differentiation, metabolism and even sleep). A few months back, I read an AMA where the researchers study how day-to-day sleep behavior is regulated. They use genetic data in their study and in 2009 they discovered a mutation in a gene that allows a person to get a fully refreshed sleep in 4-5 hours of sleep. Read the whole AMA <a href="https://www.reddit.com/r/science/comments/3kj669/science_ama_series_im_yinghui_fu_i_study_the/?ref=share&amp;ref_source=link">here</a>, you might also find answers to some of your sleep-related questions. Anyway, the way they are doing the study is,<sup>[<a href="https://www.reddit.com/r/science/comments/3kj669/science_ama_series_im_yinghui_fu_i_study_the/cuxtrx8">read the answer to the question</a>]</sup></p>

<blockquote>
  <p>The way we are going after this is by first getting a handle on what the regulatory pathways are for normal sleep regulation. Then getting an understanding of what makes these processes more efficient.</p>
</blockquote>

<p>Thus, you see the application of GRNs. It is important to emphasize that, <strong>the inference of gene regulatory networks is not the final result</strong>, but these networks are supposed to help in solving a number of different biological and biomedical problems, as the AMA also shows. Some ways in which GRNs can help are discussed in the following sections.</p>

<h3 id="causal-map-of-molecular-interactions">Causal map of molecular interactions</h3>
<p>The most common use of the GRNs might be to serve as a <strong>map or blueprint</strong> of molecular interactions. Since, GRNs represent causal biochemical interactions, a biological hypothesis about molecular interactions can be derived using these networks and tested in wet lab experiments. An important aspect of this causality is that GRNs represent statistically significant predictions of molecular interactions obtained from large-scale data. Given the very large number of potential interactions (between ~20,000 genes in Human), the GRNs are of tremendous help in narrowing these numbers down to potential interactions for which statistical support is available.</p>

<h3 id="comparative-network-analysis">Comparative network analysis</h3>
<p>When more and more gene regulatory networks from different physiological and disease conditions become available, there will be a possibility of statistically comparing these networks. This will allow to learn about interaction changes across different physiological or disease conditions and enrich our biological and biomedical understanding of such phenotypes. It might be challenging to determine which similarity or distance measures are suitable to perform such a comparative network analysis and different types of networks, as well as, different biological questions may require different approaches.</p>

<p>However, in order for this approach to succeed it will be necessary to establish databases, similar to sequence or protein structure databases, that provide free access to the inferred gene regulatory networks from different physiological and disease conditions.</p>

<h3 id="network-medicine-and-drug-design">Network medicine and drug design</h3>
<p>For establishing a network medicine useful for clinicians, it will be necessary to integrate different types of gene networks with each other, because each network type carries information about particular molecular aspects. For example, whereas the transcriptional regulatory network contains only information about the controlling regulations of gene expression, protein interaction networks represent information about protein-protein complexes. Taken together, an integration of various important molecular interaction types results in a comprehensive overview of regulatory programs and organizational architectures. Also, information about temporal (time varying) changes in the network structure are important to understand immune response, infection and differentiation processes.</p>

<p>Also, for a more efficient design of rational drugs the utilization of gene networks are indispensable. This would allow to create, e.g., a connectivity map that is based on the similarity of molecular interaction networks rather than on the mere similarity of expression profiles.</p>

<h2 id="introduction">Introduction</h2>

<h3 id="what-is-a-grn">What is a GRN?</h3>
<p>The genes, regulators, and the regulatory connections between them, together with an interpretation scheme form a gene network. <strong>Regulators</strong> are proteins, RNAs and other metabolites which can regulate (encourage or inhibit the <a href="https://en.wikipedia.org/wiki/Gene_expression">expression levels</a>) the genes. In general, each <a href="https://en.wikipedia.org/wiki/Messenger_RNA">messenger Ribonucleic Acid</a> (mRNA) molecule makes a protein (or set of proteins). Some proteins serve only to activate other genes, and these are called the <a href="https://en.wikipedia.org/wiki/Transcription_factor"><strong>transcription factors</strong></a> (regulators), the main players in regulatory networks. Each gene has a region called <a href="https://en.wikipedia.org/wiki/Cis-regulatory_module">cis-region</a>, where the regulator binds and turns them on/off, initiating the production of another protein, and so on<sup>[<a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4207011/">1</a>]</sup>.</p>

<p>A GRN represents the functionally related genes, that is, genes which are causally linked and not just correlated. GRN models can span from genetic interaction maps to physical interaction graphs to models of network dynamics and gene expression kinetics.</p>

<p><img src="https://trigonaminima.github.io/assets/regulation.JPG" alt="A Genetic Regulatory Network" />
<!-- ![A Genetic Regulatory Network](http://127.0.0.1:4000/assets/regulation.JPG) --></p>
<figcaption><strong>Fig 1: A Genetic Regulatory Network</strong>, Gene 1 produces mRNA 1 which produces the Protein 1. Now, Protein 1 binds on <em>cis-region</em> of Gene 2 and inhibits it. Gene 2 in turn produces Protein 2 through mRNA 2, which binds to the another Protein to make a protein complex. Now, this protein complex binds to Gene 3 and promotes it to produce mRNA 3 which makes the Protein 3. Finally, Protein 1 and Protein 3 binds to the Gene 1 and promotes and inhibits it, respectively. This, completes the cycle. Thus, here we have 2 feedback loops. Note that it is not necessary that there will only be feedback loops in a GRN. This example just shows a very small part of a GRN. A GRN usually consists of thousands of genes.</figcaption>
<p><br /></p>

<h3 id="what-is-microarray-data">What is microarray data?</h3>
<p><a href="https://en.wikipedia.org/wiki/Microarray">Microarray Data</a> or Gene Expression Data represents expression of genes in a particular tissue of the body under certain experimental conditions. A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a gene sample (called target). Probe-target hybridization is usually detected and quantified by detection of fluorophore, silver, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Microarray data can be used to model the GRNs.</p>

<p><img src="https://trigonaminima.github.io/assets/microarray.png" alt="Microarray Data" />
<!-- ![Microarray Data](http://127.0.0.1:4000/assets/microarray.png) --></p>
<figcaption><strong>Fig 2: A Sample Microarray Data</strong>, The selected column is the gene column. Each string ending with <em>_at</em> is a gene. Other columns represent experiments with each decimal value pertaining to the log2 transformed expression value of that gene in that particular experiment. There are 150 other columns (experiments) that couldn't be shown in the snapshot.</figcaption>
<p><br /></p>

<h3 id="different-methods-for-modeling-grns">Different methods for modeling GRNs</h3>
<p>There are a variety of modeling techniques that can be used for representing GRNs<sup>[<a href="http://www.ncbi.nlm.nih.gov/pubmed/24630831">2</a>]</sup>. Some are summarized below.</p>

<h4 id="graph-theoretical-models">Graph Theoretical Models</h4>
<p>A Graph Theoretical Model (GTM) describes the topology/architecture of a gene network. It describes the feature relationship between genes and possibly their nature. GTMs are particularly useful for knowledge representation.</p>

<p>Gene networks are represented by graph structure, \(G(V, E)\), where \(V\) (\(V \in \{1, 2, 3, \ldots, n\}\)) represents the gene regulatory elements (genes, proteins) and \(E\) (\(E = \{(i, j) \mid i, j \in V\}\)) represents interactions between them (activation, inhibition, causality, binding specificity). The edges can be directed, indicating that one node is the precursor to other or weighted, indicating the strength. The nodes and edges both can be labeled with function or nature of the relationship (activator, activation, inhibitor, inhibition, etc). <a href="https://en.wikipedia.org/wiki/Graph_theory">Graph theory</a> is pretty much the mathematical concept used here.</p>

<h4 id="bayesian-networks">Bayesian Networks</h4>
<p>A <a href="https://en.wikipedia.org/wiki/Bayesian_network">Bayesian network</a> is an annotated <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph</a>, where the nodes represent random variables (gene in our case) and the edges indicate the <a href="https://en.wikipedia.org/wiki/Conditional_probability">conditional dependencies</a> between the nodes (genes). Each node is associated with a probability function that takes, as input, a particular set of values for the node’s parent variables (parent genes’ expression values), and gives the probability (or probability distribution, if applicable) of the gene represented by the node. The technique is based on the assumption that given a gene’s parents, each gene is independent of its non-descendants. Thus, Bayesian network uniquely specifies a decomposition of the joint distribution over all variables down to the conditional distributions of the nodes <sup>[<a href="http://arxiv.org/pdf/1302.6815.pdf">pdf</a>][<a href="http://www.mrc-lmb.cam.ac.uk/genomes/madanm/blang/methods/LinkedDocuments/Zou_2005_Bioinfo.pdf">pdf</a>]</sup>. The figure below (taken from wikipedia article), explains the basic concept behind a Bayesian network.</p>

<p><img src="https://trigonaminima.github.io/assets/bn.png" alt="A simple Bayesian network" />
<!-- ![A simple Bayesian network](http://127.0.0.1:4000/assets/bn.png) --></p>
<figcaption><strong>Fig 3: A simple Bayesian network</strong>, Assuming there are two events which could cause grass to be wet: either the sprinkler is on or it's raining. Also, suppose that the rain has a direct effect on the use of the sprinkler, namely that when it rains, the sprinkler is usually not turned on. Then the situation can be modeled with a Bayesian network. All three variables have two possible values, T (for true) and F (for false). Using the model one can answer questions like <em>What is the probability that it is raining, given the grass is wet?</em>. For more details you can check out the wikipedia article. [Figure taken from wikipedia.]</figcaption>
<p><br /></p>

<h4 id="boolean-networks">Boolean Networks</h4>
<p>A <a href="https://en.wikipedia.org/wiki/Boolean_network">Boolean network</a> is a <a href="https://en.wikipedia.org/wiki/Directed_graph">directed graph</a>, where the nodes are boolean variables (genes) with an associated boolean function. Each gene is represented by a node and state of each node is determined by the boolean function associated with that gene. Assuming an ideal situation, each node has two states - on or off (1 or 0). At any given time, the states (values) of all nodes represent the state of the network. All states’ transitions together correspond to a state transition of the network from \(S(t)\) to the new network state, \(S(t + 1)\). Synchrony is another assumption of the boolean networks. Thus, whole network transits from state, \(S(t)\) to \(S(t+1)\) from time, \(t\) to \(t+1\). A series of state transitions is called a trajectory<sup>[<a href="http://web.cs.ucdavis.edu/~filkov/papers/chapter.pdf">pdf</a>]</sup>.</p>

<p><img src="https://trigonaminima.github.io/assets/BN2.jpg" style="display: block;margin-left: auto;margin-right: auto;" />
<!-- <img src="http://127.0.0.1:4000/assets/BN2.jpg" style="display: block;margin-left: auto;margin-right: auto;"> --></p>
<figcaption><strong>Fig 4: A made-up Boolean Network with 3 nodes.</strong> Arrow means the gene is promoting the one to which the arrow points. The connection from V3 to V1 (figure (a)) represents a repressive connection. Figure (b) shows the function associated with each gene or node. Figure (c) shows the truth tables of the associated functions. Figure (d) shows the trajectories followed depending on the initial network state.</figcaption>
<p><br /></p>

<h3 id="pre-processing">Pre-processing</h3>
<p>Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. <a href="https://en.wikipedia.org/wiki/Data_pre-processing">Data pre-processing</a> includes <a href="https://en.wikipedia.org/wiki/Data_quality">quality</a> control, <a href="https://en.wikipedia.org/wiki/Data_cleaning">cleaning</a>, <a href="https://en.wikipedia.org/wiki/Data_normalization">normalization</a>, <a href="https://en.wikipedia.org/wiki/Feature_extraction">feature extraction</a>, etc.</p>

<h4 id="pre-processing-in-this-study">Pre-processing in this study</h4>
<p><a href="https://en.wikipedia.org/wiki/Microarray_analysis_techniques">Microarray analysis techniques</a> are many, as the wiki article shows. What this study used were,</p>

<ul>
  <li>
    <p><strong>Quality Control</strong> (QC) assessment is a crucial first step in successful data analysis. Before any comparisons can be performed, it is necessary to check that there were no problems with sample processing, and that arrays are of sufficient quality to be included in a study. Some methods for QC checks are - visual inspection of chips, pairwise comparisons or analysis of <a href="https://en.wikipedia.org/wiki/Messenger_RNA#Degradation">RNA degradation</a>. Also, if a data set has been used in a well received publication then the chances are high that, that dataset was of preferred quality.</p>
  </li>
  <li>
    <p><strong>Normalization</strong> is a broad term for methods that are used for removing systematic variations from DNA microarray data. In other words, normalization makes the measurements from different arrays inter-comparable. The methods are largely dissimilar for different DNA microarray technologies. This <a href="http://www.bea.ki.se/staff/reimers/Web.Pages/Normalization.Intro.htm">Introduction to normalization approaches</a> kind of hits the topic on the spot. Background correction &amp; Robust multi-array average (RMA) are the methods used here, for normalizing microarray data.</p>

    <p><strong>Background correction</strong>, is the process of removing non-specific binding (mismatched spots) or spatial heterogeneity across the array. One usual (widely used) way of achieving this is subtracting the average signal intensity of the area between spots, but other methods exist as compared here in this <a href="http://www.ncbi.nlm.nih.gov/pubmed/17720982#">publication</a>.</p>

    <p>In <strong>RMA</strong>, the raw intensity values are background corrected, log2 transformed and then quantile normalized. The log2 transformation is to make the variation of expression values similar across orders of magnitude. <a href="https://en.wikipedia.org/wiki/Quantile_normalization">Quantile normalization</a> normalizes the arrays to be further meaningful in the comparisons.</p>
  </li>
  <li>
    <p><strong>Filtering</strong>, is to exclude some part of the data based on the expression of genes. There are two kinds,</p>

    <p><strong>Unspecific filtering</strong>, methods for excluding a certain part of the data without any knowledge of the grouping of the samples. It is typically used for excluding any uninteresting genes from the dataset. Genes that are not changing at all during the experiment or are expressed on a very low level so that their measurements are unreliable, are usually excluded from further analyses. If the filtering is truly unspecific, then no bias has been introduced to the statistical testing, and its results should be valid. If in doubt whether to filter or not, one can always first run a statistical test, and after that use unspecific filtering. I used this in this study, that is, filtering before and after running a statistical test. Thus, 2 sets of results were generated.</p>

    <p><strong>Specific filtering</strong>, is used in situations when the filtering is affected by the known grouping of the samples. For example, in a case-control study genes could be removed from the data using some statistical test or some other method that requires group knowledge.</p>
  </li>
</ul>

<h3 id="statistical-analyses">Statistical Analyses</h3>
<p>Statistical analysis of DNA microarray experiments is still under heavy development. There are no consensus, no strict guidelines or real rules of thumb when to apply some tests and when never to apply certain other tests. One of the widely used tools for the statistical analysis is <strong>limma</strong>, which implements linear models. One of the assumptions of the limma’s method is that the data is normally distributed (otherwise the significance tests give wrong results), but the real world data is not always normally distributed. However, usually the same method is used for all genes, and the results are therefore only approximate. One can probably rank the genes according to the p-values, but assuming that the p-values are unbiased in the traditional statistical sense is an illusion.</p>

<h3 id="gene-set-enrichment-analysis-gsea">Gene Set Enrichment Analysis (GSEA)</h3>
<p>GSEA is used to describe all methods that are used for statistically testing whether genes in our list of interesting genes are enriched in some pathways or functional categories. Typically these methods employ <strong>hypergeometric test based statistics</strong>. In a hypergeometric experiment we randomly select a sample of size \(n\) (without replacement) from a population of size \(N\). In the population, \(k\) items can be classified as successes and \(N-k\) as the failures. The probability of getting exactly \(x\) successes in \(n\)-trials in a population of \(N\) items is termed as, <strong>hypergeometric probability</strong> and the probability distribution obtained by taking the number of successes as a hypergeometric random variable is called a <strong>hypergeometric distribution</strong><sup>[<a href="http://stattrek.com/probability-distributions/hypergeometric.aspx">1</a>, <a href="http://mathworld.wolfram.com/HypergeometricDistribution.html#">2</a>]</sup>, given as follows,</p>

\[P(X=x) = f(x; N, n, k) = \frac{\binom{K}{x} \binom{N-k}{n-x}}{\binom{N}{n}}\]

<p>For example: We have 36 balls (\(N\)) (6 <em>good</em> balls (\(k\)) and 30 <em>bad</em> balls (\(N-k\))). So, the probabilities of getting \(i\) <em>good</em> balls out of 6 balls drawn is generated by the following code.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">gridExtra</span><span class="p">)</span><span class="w">

</span><span class="n">m</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">6</span><span class="p">;</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">30</span><span class="p">;</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">6</span><span class="p">;</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="o">:</span><span class="p">(</span><span class="n">k</span><span class="m">+1</span><span class="p">)</span><span class="w">

</span><span class="c1"># dhyper calculates the probability values.</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="m">0</span><span class="o">:</span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">dhyper</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">))</span><span class="w">
</span><span class="n">mytable</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="s2">"Good balls"</span><span class="o">=</span><span class="m">0</span><span class="o">:</span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="n">Probability</span><span class="o">=</span><span class="nf">round</span><span class="p">(</span><span class="n">dhyper</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="m">8</span><span class="p">))</span><span class="w">

</span><span class="n">png</span><span class="p">(</span><span class="s2">"hgeo.png"</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">640</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">560</span><span class="p">,</span><span class="w"> </span><span class="n">pointsize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="m">0</span><span class="o">:</span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">annotation_custom</span><span class="p">(</span><span class="n">tableGrob</span><span class="p">(</span><span class="n">mytable</span><span class="p">),</span><span class="w"> </span><span class="n">xmin</span><span class="o">=</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">xmax</span><span class="o">=</span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="n">ymin</span><span class="o">=</span><span class="m">0.3</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="o">=</span><span class="m">0.4</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Good balls obtained"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">ylab</span><span class="p">(</span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"Hypergeometric Distribution"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p><img src="https://trigonaminima.github.io/assets/hgeo.png" style="display: block;margin-left: auto;margin-right: auto;" />
<!-- <img src="http://127.0.0.1:4000/assets/hgeo.png" style="display: block;margin-left: auto;margin-right: auto;"> --></p>
<figcaption><strong>Fig 5: Hypergeometric Distribution.</strong> The table shows the probabilities of the hypergeometric random variable.</figcaption>
<p><br /></p>

<p>In statistics, the <strong>hypergeometric test</strong> uses the hypergeometric distribution to calculate the statistical significance of having drawn a specific k successes (out of n total draws) from the population. The test is often used to identify which sub-populations are over- or under-represented in a sample.</p>

<h4 id="go-categories">GO categories</h4>
<p>Gene ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to:</p>

<ol>
  <li>Maintain and develop its controlled vocabulary of gene and gene product attributes.</li>
  <li>Annotate genes and gene products, and assimilate and disseminate annotation data.</li>
  <li>Provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis.</li>
</ol>

<p>An ontology is a representation of something we know about. “Ontologies” consist of a representation of things that are detectable or directly observable, and the relationships between those things. The ontology covers three domains:</p>

<ul>
  <li><strong>Cellular component</strong>, the parts of a cell or its extracellular environment.</li>
  <li><strong>Molecular function</strong>, the elemental activities of a gene product at the molecular level, such as binding or catalysis.</li>
  <li><strong>Biological process</strong>, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.</li>
</ul>

<p>Each GO term within the ontology has a term name, which may be a word or string of words; a unique alphanumeric identifier; a definition with cited sources; and a namespace indicating the domain to which it belongs. The GO ontology is structured as a directed acyclic graph, and each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. The GO vocabulary is designed to be species-neutral. The GO ontology file is freely available from the <a href="http://amigo.geneontology.org/">GO website</a>.</p>

<h4 id="kegg-pathways">KEGG pathways</h4>
<p>KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. The KEGG database project was initiated in 1995 by Minoru Kanehisa, Professor at the Institute for Chemical Research, Kyoto University, under the then ongoing Japanese Human Genome Program.</p>

<p>It is a collection of manually drawn KEGG pathway maps representing experimental knowledge on metabolism and various other functions of the cell and the organism. Each pathway map contains a network of molecular interactions and reactions and is designed to link genes in the genome to gene products (mostly proteins) in the pathway. This has enabled the analysis called KEGG pathway mapping, whereby the gene content in the genome is compared with the KEGG PATHWAY database to examine which pathways and associated functions are likely to be encoded in the genome.</p>

<h3 id="clustering">Clustering</h3>

<h4 id="heatmap-or-hierarchical-clustering">Heatmap or Hierarchical Clustering</h4>
<p>Heatmap presents hierarchical clustering of both genes and arrays, and additionally displays the expression patterns, all in the same visualization. In order to decide where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations using distance functions like <a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a> or <a href="https://en.wikipedia.org/wiki/Taxicab_geometry">Manhattan distance</a> or <a href="https://en.wikipedia.org/wiki/Minkowski_distance">Minkowski distance</a>), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. Some commonly used linkage criterias are complete-linkage, single-linkage or average-linkage.</p>

<p>The visualization of hierarchical clusters in a heatmap is shown with the help of <a href="https://en.wikipedia.org/wiki/Dendrogram">dendogram</a>. There are many different <strong>color schemes</strong> that can be used to illustrate the heatmap, with perceptual advantages and disadvantages for each. Rainbow colormaps are often used, as humans can perceive more shades of color than they can of gray, and this would purportedly increase the amount of detail perceivable in the image. However, this is discouraged by many in the scientific community. The usual coloring scheme for microarray data in heatmaps is to present down-regulated genes with green, and up-regulated genes with red.</p>

<p><img src="https://trigonaminima.github.io/assets/a_heatmap_s.png" style="display: block;margin-left: auto;margin-right: auto;" />
<!-- <img src="http://127.0.0.1:4000/assets/a_heatmap_s.png" style="display: block;margin-left: auto;margin-right: auto;"> --></p>
<figcaption><strong>Fig 6: A Heatmap.</strong> On the y-axis, there are 121 samples clustered hierarchically for 3 genes (y-axis). At the top and left side of the heatmap are the dendograms, showing the hierarchy under which the genes was clustered, both, sample-wise and gene-wise respectively.</figcaption>
<p><br /></p>

<h4 id="k-means-clustering">k-Means Clustering</h4>
<p>k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum.</p>

<p>k-means clustering does not produce a tree, but divides the genes or arrays into a number of clusters. In contrast to hierarchical clustering, k-means clustering is feasible even for very large datasets. However, before the analysis, user has to specify how many clusters should be returned. Unfortunately, there are no good rules of thumb for estimating the starting number of clusters before the analysis. Although, a technique will be discussed in next post which gives us a good estimate for the number of clusters in our case.</p>

<p><img src="https://trigonaminima.github.io/assets/a_kmeans.png" style="display: block;margin-left: auto;margin-right: auto;" /></p>
<figcaption><strong>Fig 7: Clusters obtained after k-means.</strong> The visualization of k-means that you might have seen might be different from the one shown above. In the clusters shown, each gene is represented by a line on the graph with it's expression changing across samples.</figcaption>
<p><br />
<br /></p>

<p>This above introductory description of the Genetic Regulatory Networks and concepts surrounding it, was necessary for the next post about the work. I am yet to build a proper GRN, but I am hoping to generate one by the next summer. Let’s see how it goes.</p>]]></content><author><name>Shivam Rana</name></author><category term="Bio-Informatics" /><summary type="html"><![CDATA[Here goes the documentation of the work I did during my summer internship. Yeah, I know, this is coming quite, quite late, but hey! Better late than never! All the coding was done in R (check out the repo). There will also be another post which will talk about the technical (basically, R code) details.]]></summary></entry><entry><title type="html">Shiny Apps</title><link href="https://trigonaminima.github.io/2015/07/shiny-apps/" rel="alternate" type="text/html" title="Shiny Apps" /><published>2015-07-02T00:00:00+00:00</published><updated>2015-07-02T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2015/07/shiny-apps</id><content type="html" xml:base="https://trigonaminima.github.io/2015/07/shiny-apps/"><![CDATA[<p>This a collection of small shiny apps I have made (or going to make) to learn <a href="http://shiny.rstudio.com/">shiny</a> (by RStudio). The list of apps in this repo are listed (and documented) on this page. To learn more about the apps (what they do and how were they developed to do what they do) just read further.</p>

<p>One piece of <em>advice!</em> The following write-ups on the apps are specifically for those who are developing shiny apps. You can find the deployed versions of these apps on the <a href="https://www.shinyapps.io/">shinyapps.io</a> but with the free versions of the platform I can only run a maximum of 5 apps at a time. So, many apps will most probably be sleeping, although you can try your luck. May be, you’ll find that app running.</p>

<p>Although, looking at the working examples will be more helpful, but if you want to have any background knowledge on shiny - how it works, how to deploy your app, how to design your ui in HTML and much more - then have a look at <a href="http://shiny.rstudio.com/articles/">shiny articles</a>. Shiny have prepared a pretty good material to learn the platform by yourself.</p>

<p>If you would like to see the code and tinker with it, there are 2 ways.</p>

<ol>
  <li>Just follow the commands,</li>
</ol>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>git clone https://github.com/TrigonaMinima/shiny_apps
<span class="nv">$ </span><span class="nb">cd </span>shiny_apps/
<span class="nv">$ </span>R</code></pre></figure>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Inside R console</span><span class="w">
</span><span class="c1"># install.packages("shiny")</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">shiny</span><span class="p">)</span><span class="w">
</span><span class="c1"># Appname is the name of directory present inside the repository thus,</span><span class="w">
</span><span class="c1"># to run "wordcloud" run this</span><span class="w">
</span><span class="n">runApp</span><span class="p">(</span><span class="s2">"wordcloud"</span><span class="p">)</span></code></pre></figure>

<ol>
  <li>To make your life (and mine) simple I have hosted each app on the github gist. And, shiny has been kind enough to provide a way to directly run the app from gists.</li>
</ol>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Inside R console</span><span class="w">
</span><span class="c1"># install.packages("shiny")</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">shiny</span><span class="p">)</span><span class="w">
</span><span class="n">runGist</span><span class="p">(</span><span class="s2">"&lt;gist_number&gt;"</span><span class="p">)</span><span class="w">
</span><span class="c1"># gist_number will be given by me of course on the individual app page.</span><span class="w">
</span><span class="c1"># The above instruction will launch the app in your browser.</span></code></pre></figure>

<p>Okay, enough explaining. Lets look at the app(s).</p>

<ul>
  <li><a href="https://trigonaminima.github.io/shiny_apps/2014/06/28/wordcloud/">Wordcloud</a></li>
</ul>]]></content><author><name>Shivam Rana</name></author><category term="R" /><summary type="html"><![CDATA[This a collection of small shiny apps I have made (or going to make) to learn shiny (by RStudio). The list of apps in this repo are listed (and documented) on this page. To learn more about the apps (what they do and how were they developed to do what they do) just read further.]]></summary></entry><entry><title type="html">Reconnaissance on Shadab</title><link href="https://trigonaminima.github.io/2014/11/reconnaisance-on-shadab/" rel="alternate" type="text/html" title="Reconnaissance on Shadab" /><published>2014-11-23T00:00:00+00:00</published><updated>2014-11-23T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/11/reconnaisance-on-shadab</id><content type="html" xml:base="https://trigonaminima.github.io/2014/11/reconnaisance-on-shadab/"><![CDATA[<p>Once, me and Shadab were discussing, how traceable were we on the Internet? So, we decided to do this reconnaissance on each other. This is the documentation of how I did the same on him and how much I could trace him.</p>

<p>Let me first tell you how much I know about him. His name is <em>Shadab Zafar</em>, pursuing B.Tech from <a href="http://jmi.ac.in/">Jamia Milla Islamia</a>, same as me. He and me both are a part of <a href="http://jmilug.org/cms/">JMILUG</a> (a group sort of thing in our college promoting the open source. Hail Linux!!) under which we host some events from time to time. He did a GSoC project in metabrainz foundation. He loves his laptop very much, been a coder since like 10th grade or may be earlier. Here’s his <a href="https://github.com/dufferzafar">Github account</a>, he has coded a lot of things. He is proficient in web development, interested in Cryptography and Security and Hacking-n-stuff. He was Islamic but now he is an atheist thanks to some books I guess. He has read quite a lot of books.He has one sibling five years younger than him. Well, all of this is sounding more of an endorsement of Mr Shadab Zafar. I am feeling weird now. Lets stop. I will be attempting to trace all the above things I have said and may be more.
The things I am assuming I know is his name, <strong>Shadab Zafar</strong>, his face and the college where he studies. I will be using Incognito Mode of Chrome Browser to search for him. This way, I am not signed in to any social media account, which might have aided me to find him easily. Of course, Google will be the search engine.</p>

<p>Lets start with Google Images first. Search string is “Shadab Zafar”. On second row I found 2 images of him. Awesome.
This is how he looks - <a href="http://gsoc.jmilug.org/gsoc/assets/img/speaker/speaker-4.jpg">first</a>, <a href="https://avatars1.githubusercontent.com/u/1449512?s=460">second</a> (I know pathetic he looks, but lets not judge a book by its cover.)
Clicking on ‘Visit page’ on each image we are sent to 2 pages,</p>

<ul>
  <li>JMILUG recently hosted an event where the GSoCers came and spoke about their project.Thus, here our Shadab Zafar was one of the speakers. One link took us to this event page.</li>
  <li>Other link directed to his Github profile.</li>
</ul>

<p>Lets see what we have got here from the above two links:</p>

<ol>
  <li>Name - Shadab Zafar</li>
  <li>GSoC Project - A New Website for MusicBrainz Picard</li>
  <li>Github Handle - <a href="https://github.com/dufferzafar"><strong>dufferzafar</strong></a></li>
  <li>Lives in - India</li>
  <li>Email - dufferzafar0@gmail.com</li>
  <li>Blog - <a href="http://dufferzafar.github.io">dufferzafar.github.io</a></li>
  <li>Github Repositories - <a href="https://github.com/dufferzafar?tab=repositories">here</a>, have a look. I am not gonna type all of ‘em.</li>
</ol>

<p>Now lets look at his repos and find out his interests in programming languages.
Number of projects in each language are given below,</p>

<table>
  <tbody>
    <tr>
      <td>Python</td>
      <td>-</td>
      <td>8</td>
    </tr>
    <tr>
      <td>AutoHotkey</td>
      <td>-</td>
      <td>5</td>
    </tr>
    <tr>
      <td>Lua</td>
      <td>-</td>
      <td>4</td>
    </tr>
    <tr>
      <td>JS</td>
      <td>-</td>
      <td>4</td>
    </tr>
    <tr>
      <td>CSS</td>
      <td>-</td>
      <td>3</td>
    </tr>
    <tr>
      <td>Java</td>
      <td>-</td>
      <td>1</td>
    </tr>
    <tr>
      <td>Go</td>
      <td>-</td>
      <td>1</td>
    </tr>
    <tr>
      <td>Shell</td>
      <td>-</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<p>Clearly Python is the winner but looking at the nature of projects under python and the projects in JS and CSS, it can be said that he does a lot of web dev. Also, looking at one of the repos named ‘ctf-docs’ to which he has contributed, relates to the intro to security competitions. Thus, it shows his interest in cryptography and security.</p>

<p>Now, lets go to his <a href="http://dufferzafar.github.io">blog</a>.</p>

<ol>
  <li>Blog Name - Duffer’s Log</li>
  <li>Blog Bio - Life. Philosophy. Code.</li>
  <li>Last written on - 25th June, 2014</li>
  <li>Total posts written - 24</li>
</ol>

<p>Looking at the first post he wrote this year (2014) entitled, “It’s Been Five Years” we can know some interesting facts about him.</p>

<ul>
  <li>He tried to make a personalized Windows XP (here, is his <a href="http://www.msfn.org/board/topic/146791-sp3-cd-cannot-format-driveshelp/">unattended question</a>)</li>
  <li>Another <a href="http://shadabsofts.wordpress.com/">blog</a> he wrote as a 12th grader - “SHADAB SOFTS. INC.”</li>
</ul>

<p>Now, from this blog we get to know about him a some more (written by him, of course, on his <a href="http://shadabsofts.wordpress.com/about/">about page</a>),</p>

<ol>
  <li>Hails from - Haryana, India</li>
  <li>Birthday - 24th August</li>
  <li>Zodiac - Virgo</li>
  <li>Philosophy of life - “Think More, Do More, Expect Less”</li>
  <li>Ambition in life - To become a “Computer Forensics Expert” (told ya, he was interested in security and cryptography and Hacking-n-stuff.)</li>
</ol>

<p>This, above information was all from the image search by the search string “Shadab Zafar”.
Now, we have one more search string - dufferzafar - apparently he uses this term in lot of things as visible from his blogurl, Github handle, email and we will find the same pattern in other things like usernames, twitter handle, etc. too.</p>

<p>Now, lets see what will we get by searching images for dufferzafar.
Here again we get the Github profile pic. Then there is also the home page of his blog in probably from one of his posts. There might me more from his blog but I am not getting into that. There is also an art having</p>

<p>From the above image results a lot of information on Shadab was easily available. I didn’t expect that when I started doing this. That’s enough with image search. Lets get to web search starting from search string “Shadab Zafar”.</p>

<p>So, from the first page itself I get,</p>

<ul>
  <li>Github Profile (again)</li>
  <li>
    <p><a href="https://twitter.com/dufferzafar">Twitter Profile</a>
Here, from twitter we get his current location, which is Delhi and his blog url which we have already seen. We don’t get much useful content, except for his 237 tweets, 23 followers and the list of 142 people he follows.</p>
  </li>
  <li><a href="http://in.linkedin.com/pub/shadab-zafar/95/47/70">LinkedIn Profile</a>
From LinkedIn we get his college, Jamia Millia Islamia.
We also get that he has worked in MetaBrainz Foundation as a student developer under GSoC.
And, we also get JS, CSS and web development from his skills section in the profile.
Since I didn’t sign in, I couldn’t see his whole profile.</li>
</ul>

<p>On second search page we have his <a href="http://www.quora.com/Shadab-Zafar">Quora profile</a>. There I can see the college he goes to and the <a href="http://qph.is.quoracdn.net/main-thumb-15042265-200-GMZ0kCRWLcGDPOZVwlYwiO0Wuw7LXTyR.jpeg">profile pic</a> on which the text says “Life Runs On Code”. This obviously seems to be the guy we are inspecting.
I also saw this image when I searched images for “Shadab Zafar” and “dufferzafar”. Going back for those images, we get to this same quora profile also to his <a href="http://stackoverflow.com/users/2043048/dufferzafar">StackOverflow profile</a>. Here, he has a reputation of 11. And, have accounts on <a href="http://superuser.com/">Super User</a> and <a href="http://english.stackexchange.com/">English Language &amp; Usage</a> which are also the parts of the <strong>Stack Exchange Inc.</strong> along with StackOverflow.</p>

<p>Searching by image, using this above image in Google search, unfortunately we don’t get anything. Now, it’s time for our final search, that is, searching on Google using the search term “dufferzafar”.
Behold, here we get,</p>

<ul>
  <li>Github Account (and again!!)</li>
  <li>Twitter Profile</li>
  <li>StackOverflow Account</li>
  <li>
    <p><a href="https://bitbucket.org/dufferzafar">BitBucket Profile</a> (another Github like website)
From here we get sure that his location is India. He is a member of this website since Jan 2014. We get his blog url from here too.</p>

    <ul>
      <li>Location - India</li>
      <li>Member Since - Jan 2014</li>
      <li>Blog - <a href="http://dufferzafar.github.io">dufferzafar.github.io</a></li>
    </ul>
  </li>
  <li>
    <p><a href="https://trello.com/dufferzafar">Trello Account</a>
In their own terms, “Trello is the free, flexible, and visual way to organize anything with anyone.”.
We get the same image showing “Life Runs On Code” on this service as his profile pic. This is definitely our guy. He now seems inactive on this site but looking at his past activities I observed that there are 2 boards he was a part of, namely, <strong><a href="https://trello.com/b/dXLn85dG/cryptex">Cryptex</a></strong> and <strong><a href="https://trello.com/b/9Gg9LZFi/projects">Projects</a></strong>.
Cryptex seems to be a hacking competition similar to one he contributed to on Github. Whereas, on projects board we see a few projects listed. Nothing else. Summarizing,</p>
  </li>
  <li>
    <p><a href="http://www.last.fm/user/dufferZafar">Last.fm Profile</a>
Lastfm, is a place where you share with the world what type of music listen to and how frequently. Then, it recommends you some more music based on your music taste. Well, we are not going to determine his taste here. But a few highlights from the account are,</p>

    <ul>
      <li>Blog - <a href="http://dufferzafar.github.io">dufferzafar.github.io</a></li>
      <li>Age - 21</li>
      <li>Member Since - 2 Apr, 2013</li>
      <li>Total songs played - 16542</li>
      <li>Loved Tracks - 134</li>
      <li>Total Artists in the Library - 844</li>
      <li>Most listened band - Kings of Leon</li>
    </ul>
  </li>
  <li>
    <p><a href="http://forums.musicbrainz.org/profile.php?id=8071">MusicBrainz Forums Account</a>
A forum where he Registered on 10 Jul 2014. He has discussed about his GSoC Project there from the community for whom he did the project.</p>
  </li>
  <li>
    <p><a href="https://keybase.io/dufferzafar">Keybase.io profile</a>
Keybase, is a place where you get a public key, safely, starting just with someone’s social media username(s). Mr Zafar, have registered here too.</p>
  </li>
  <li>
    <p><a href="http://www.reddit.com/user/dufferZafar">Reddit Account</a>
Everyone knows <a href="http://www.reddit.com/">Reddit</a>, no wonder he is a redditor for 1 year. He has written 6 comments in total having a comment karma of -2. Pathetic. Most of his comments are on the posts of sub-reddit <a href="http://www.reddit.com/r/windowsphone/">r/windowsphone</a>.</p>
  </li>
  <li>
    <p><a href="https://sourcegraph.com/dufferzafar">Sourcegraph profile</a> (yet another Github like website)
There’s nothing done here by him except providing the links to his blog and Github Profile.</p>
  </li>
  <li>
    <p><a href="https://soundcloud.com/dufferzafar">SoundCloud Account</a>
Again, nothing here. Just made an account and abandoned it.</p>
  </li>
  <li>
    <p><a href="https://www.facebook.com/DufferZafar">Facebook Profile</a>
Here, he has shared very less with the public. You can see his name, profile pic, cover photo and favorites. This all doesn’t give much about him. His profile pic is similar to his profile pic on Quora. And his <a href="https://plus.google.com/104192614328343170021/posts?pid=5873012826062854162&amp;oid=104192614328343170021">cover photo</a> is similar to one I saw when I searched for images of “Shadab Zafar”. Guess what? It takes me to his G+ profile where he is apparently not that active.</p>
  </li>
  <li>
    <p><a href="https://www.hackthissite.org/user/view/dufferzafar">HackThisSite Account</a>
Hack This Site is a free, safe and legal training ground for hackers to test and expand their hacking skills. Mr Zafar, was once the active member of this site. Some stats are,</p>

    <ul>
      <li>Joined on - 14 Oct 2012</li>
      <li>Basic Level - 10</li>
      <li>Realistic Level - 2</li>
      <li>Application Level - 1</li>
      <li>JavaScript Level - 6</li>
    </ul>
  </li>
  <li>
    <p><a href="http://www.oninstagram.com/profile/dufferzafar">Oninstagram profile</a>
This guy has posted a single image on Instagram in his life and you can see that on this above link.</p>
  </li>
  <li>
    <p><a href="http://www.goodreads.com/user/show/18654747-shadab-zafar">Goodreads Profile</a>
I told you that this guy reads a lot of books. Well above link will show you how much. Goodreads is a lastfm like place but for books. Share what you have read and what you are reading. He has rated 58 books with an average rating of 3.81 and has written 5 reviews yet. There are 194 books in his ‘to-read’ list. That’s a lot of books.</p>
  </li>
  <li>Zafar’s <a href="http://www.snip2code.com/Snippet/56871/My-GSoC-2014-Proposal">GSoC 2014 Proposal</a></li>
  <li>
    <p><a href="http://dufferzafar.deviantart.com/">dufferzafar.deviantart.com</a>
The image <a href="http://dufferzafar.deviantart.com/art/FB-Cover-428047329">here</a> also showed up in the image search for ‘dufferzafar’ which linked to the above account. From his account on this site we get some more info,</p>

    <ul>
      <li>location - India</li>
      <li>Blog - <a href="http://dufferzafar.github.io">dufferzafar.github.io</a></li>
      <li>Favorite bands / musical artists - Kings Of Leon</li>
      <li>Favorite books - Harry Potter</li>
      <li>Favorite writers - John Green, J K Rowling</li>
      <li>Favorite games - Age of Empires</li>
    </ul>

    <p>Now, Kings of Leon is his most listened band as shown on the Lastfm. And here he has mentioned it as his favorite band. Thus, I think we can confidently say this that he likes this band a lot.
  Secondly, Harry potter was rated highly by him on Goodreads. That again is clearly his most liked book.</p>
  </li>
  <li><a href="https://coderwall.com/dufferzafar">Coderwall Account</a>
A place to show a gist of everything you have ever coded on. Here again, as observed from the Github repos he has coded a lot in Python and JavaScript.</li>
</ul>

<p>Fuck!!
I have reached page 3 on Google search and yet new web services or comment on some forum or a question asked on some forum keeps popping up. This is a cumbersome task. This tediousness was the reason this post was on hold for 4 months.</p>

<p>The fucking result of this whole post is that Mr Zafar, you are pretty traceable. And, this amount of information was gathered when,</p>
<ul>
  <li>I didn’t sign in on any service.</li>
  <li>I didn’t go for his friends on the various services he was present on else, I might have covered more of his web presence.</li>
  <li>Did just a few name searches. A determined person can get hell of a lot more information on him.</li>
</ul>

<p>As for me, I don’t know how much traceable I am. I hope Mr Zafar will check the same for me. Although, unlike him I am not that active on the web, so lets see how much data he gathers on me.</p>

<p>In writing this post (a bit fun and a quite lot of tedious) I got an <strong>idea for a project</strong>. Although, I think it must already be made by someone. Still, I’ll explain it here. You might have already guessed it.</p>

<p>A tool to find everything about a person of the web. Everything meaning everything,</p>

<ul>
  <li>How many services he is present on?</li>
  <li>When did he join that service?</li>
  <li>On which forums he commented on?</li>
  <li>What type of data he shared?</li>
  <li>With whom he shared?</li>
  <li>Preferences in various things - songs, books, web-services, etc.</li>
  <li>Trying other search engines (Bing, DuckDuckGo)</li>
  <li>I can’t guess more, but, clearly there are many more things still remaining.</li>
</ul>

<p><strong>How can we go about doing this?</strong></p>

<p>Honestly, I don’t know.
If I were to do the same at my present knowledge then, I guess I would go exactly as I did in this post.</p>

<p>I will assume I have one image and the real name to search for. Firstly, the images will be searched, for the given name and on all the images a comparison will be calculated from the provided image. Thus, tentative results will be gathered. These results might be in the form of a dictionary where for each service there is required data.
And, there will also be 2 separate lists for the images to search for and search terms to search for. These lists will keep on growing as new services/accounts will be discovered. Images list might grow due to the images uploaded as profile pic, shared on social networking sites, included on the blog etc. And the ‘search term’ list will grow as new usernames, first name, middle name, last names will be gathered over the searches.
Thus, by searching recursively we might be able to get a considerable amount of data on a person to create a sort of timeline of him on the Internet and generate a profile of him.</p>

<p>Now, as you can see, this tool is not much of a use for the general public. It is almost a spying/stalking/recon tool. I can’t think where else it might be used legally. Anyway, this was just an idea. There are a lot of constraints here in making this piece of software.</p>

<p><strong>EDIT</strong>
The approach I used to find about Shadab Zafar is not a general approach, I have realized this. After writing this post one of my friend Aditya, made me realize that there are some factors that come into play while finding about a person.</p>

<ul>
  <li>Person need to have a good web presence.</li>
  <li>A unique identifier (like dufferzafar in Shadab’s case).</li>
  <li>Real name is not that common.</li>
</ul>

<p>I saw these facts unfolding in front of me by searching about Aditya. His name is pretty common so, I got a lot of results with the same name. In the image search I never got a result matching his image (but there were a lot of faces with which the image recog will give false positives).
Moreover, in the web results too, I didn’t get anything on him on the first few pages of Google search. This again fails my approach of finding about a person. And, hence this totally destroys the tool if the person is doesn’t have 1 or 2 things mentioned above.</p>

<p>SR.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[Once, me and Shadab were discussing, how traceable were we on the Internet? So, we decided to do this reconnaissance on each other. This is the documentation of how I did the same on him and how much I could trace him.]]></summary></entry><entry><title type="html">Gamification of Life</title><link href="https://trigonaminima.github.io/2014/11/gamification-of-life/" rel="alternate" type="text/html" title="Gamification of Life" /><published>2014-11-08T00:00:00+00:00</published><updated>2014-11-08T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/11/gamification-of-life</id><content type="html" xml:base="https://trigonaminima.github.io/2014/11/gamification-of-life/"><![CDATA[<p>Almost two months back I stumbled upon <a href="https://productivity.stackexchange.com/questions/2972/gamification-to-improve-myself">this</a> question on <a href="https://productivity.stackexchange.com/">Personal Productivity Stack Exchange</a>. Here, someone asked the following question,</p>

<blockquote>
  <p>I’m currently working out a way to gamify my daily work / life. I want to reward myself via a points system, assigning points to tasks I should do but tend to neglect.</p>

  <p>These are things like paying invoice in a timely manner and answering important emails but also things I want to do, but never get around to, like working out from time to time.</p>

  <p>As I’m the only “judge” I also work on way to prevent that I game the system, but this isn’t a real problem for me because I tend to be objective and don’t think I’ll cheat myself.</p>

  <p>Have you tried something like this?
   Did it work out?
   Any input?</p>
</blockquote>

<p>I myself was searching for a way to quantify and observe my activities. And, this question gave me an almost perfect solution to achieve the same just like an idiomatic socratic mode of enquiry. It also gave me some fun coding ideas. Although 1st answer helped to think, I wanted something that really quantified the things in my own ways. In my own priorities. Besides, it will get me started towards the <a href="https://en.wikipedia.org/wiki/Quantified_Self">Quantified Self</a> concept I was thinking of doing.</p>

<p>So, I tried to think of some ways to achieve this ‘Gamification of my Life’.</p>

<p>At first, I thought of making a small python script that’ll enter all the data given to it in an excel sheet. But, then there were some problems,</p>

<ul>
  <li>
    <p>How to give input to the Script?
GUI might have been an answer but that was too much work. Wow, I just got a solution while writing this point. I can take input from a text file. Writing the data to be fed in a defined format. Nice. (still seems to be a hassle to work with)</p>
  </li>
  <li>
    <p>After input, I had to think of the points over which I was going to quantify myself.
That is, to decide over the scoring of activities I do. Since, I didnt know if my current decided criteria will be final or not, I couldn’t start writing this script. It might need some major ammendments in the future. This was the point much required than the first one.</p>
  </li>
</ul>

<p>Therefore, I first decided to explore the criteria/categories for the quantification. A set of categories I already had in mind, seemed to encompass every aspect I wanted to be covered. But, after 2-3 weeks I had to modify this set with some more elements. This was the final set which haven’t been modified more.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Categories = {Read, Write, Code, Music, Watch, Travel, Other Productive/Worthwhile activities}
</code></pre></div></div>

<p>There were many activities over which I wanted to observe myself. The broad categories under which they will come are written above. Most of the activities are described below.</p>

<p><strong>Watch</strong></p>

<ul>
  <li>Movies</li>
  <li>Shows</li>
  <li>Courses</li>
  <li>Ted</li>
  <li>Other Informative Videos (documentaries, debates etc)</li>
  <li>Entertainment videos</li>
</ul>

<p><strong>Read</strong></p>

<ul>
  <li>Novels</li>
  <li>Books other then Novels</li>
  <li>Blog Articles (Pocket, Fb saved links, Quora saved answers, stack exchange etc)</li>
  <li>Research Papers</li>
</ul>

<p><strong>Write</strong></p>

<ul>
  <li>Blog Post</li>
</ul>

<p><strong>Code</strong></p>

<ul>
  <li>Small Codes (Algos, Competitive Programming solutions, short projects)</li>
  <li>Completing A Project</li>
  <li>Maintaining Past Project</li>
  <li>Starting New Project</li>
  <li>Improving other Skills</li>
</ul>

<p><strong>Music</strong></p>

<ul>
  <li>Violin</li>
  <li>Listening</li>
</ul>

<p><strong>Travel (weekly)</strong></p>

<ul>
  <li>Alone to New Place</li>
  <li>Alone to Known Place</li>
  <li>With Friends to New Place</li>
  <li>With Friends to Known Place</li>
</ul>

<p>Now the scores were decided. You can see the score distribution in the image below.</p>

<p><img src="https://trigonaminima.github.ioassets/gamification_score.png" alt="Scoring" /></p>

<p>I know. I know. This was a hilarious score distribution.
Don’t laugh over it. Okay?</p>

<p>Now, from the above scores, some numbers and figures were crunched.</p>

<table>
  <tbody>
    <tr>
      <td>Week Length</td>
      <td>:</td>
      <td>Sunday to Saturday (7 days)</td>
    </tr>
    <tr>
      <td>New Week starts from</td>
      <td>:</td>
      <td>Sunday</td>
    </tr>
    <tr>
      <td>Maximum Score per week</td>
      <td>:</td>
      <td>345*</td>
    </tr>
  </tbody>
</table>

<p>*This high score was calculated assuming that I do every proclivity shown in the image above.</p>

<p>Thus, I created this score system to help myself, track well.. myself. It’s been 10 weeks since I started this <em>self tracking</em>. Some of the scores changed over the period and might change further. This was somewhat an enlightening period. I tracked myself almost completely, having a proper record of what I do during the week which might be a time waste or productive or worthwhile.</p>

<p>My weekly score started from a total of <strong>75</strong> which is the <strong>lowest score</strong> yet. I reached to a <strong>maximum</strong> of <strong>186.8</strong>. From the start I progressed the score from 75 to 186.8 following which I was unwell for a week.</p>

<p>Now, here’s an interesting thing. In this “unwell week” I had a decent score of 143.4 but the week following this was the second lowest with a score of 85.6. I don’t know, if this was a fluke but I will observe if this pattern is encountered again the next time I get unwell.</p>

<p>Now, I have to decide whether I should make the script/GUI of this whole process. Moreover, I would like to get some stats, some conclusion or glossary kind of thing, some prediction system, some recommendation system (to recommend to me something like, what activity should I do which I haven’t done for a long time). So, I will most probably be making a script to do some or all of the things stated. And, it wouldn’t hurt to build a GUI upon it, I suppose.</p>

<p>Also, along with this fun expt., I started making a list of movies I watched. I added in some previous ones I have watched along with the ones I have watched during these 10 weeks. There are 124 movies as of now along with the some other meta data like language it was in, subs or not, type (animation, documentary, normal), releasing year, ratings, country it belongs to (according to production house), etc. This gave rise to some other fun stats and data to play with. I Will probably write another post with those analysis results.</p>

<p>SR.</p>]]></content><author><name>Shivam Rana</name></author><category term="Quantified-Self" /><summary type="html"><![CDATA[Almost two months back I stumbled upon this question on Personal Productivity Stack Exchange. Here, someone asked the following question,]]></summary></entry><entry><title type="html">Tourism Improved</title><link href="https://trigonaminima.github.io/2014/09/tourism-improved/" rel="alternate" type="text/html" title="Tourism Improved" /><published>2014-09-11T00:00:00+00:00</published><updated>2014-09-11T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/09/tourism-improved</id><content type="html" xml:base="https://trigonaminima.github.io/2014/09/tourism-improved/"><![CDATA[<p>####<strong>Problem statement</strong>
India is one of the countries having a diverse cultural heritage and thus a wide variety of tourist places. Some of those are still not up to their full potential to be a contributing factor to the India’s economy like other recognized tourist spots. Our tool is focused on the improvement of same tourist places in India which have potential to generate a very good revenue but not generating it. Improvement is suggested on the basis of the previous data of the famous tourist places and present data of the tourist place to be improved.</p>

<p>####<strong>What data we are using?</strong></p>

<p>Whatever data we are using is publicly available. Various sources for the availability of data are listed here,</p>

<p><strong>1. Governmental Data</strong>
    There is a lot of data available provided by the government in various forms. Data like facts and figures related to the tourism in India and state wise data; data like categorization of various Tourist places for eg. Historical, Religious, Hill Stations, Adventure etc. Some data sources available to us are,</p>

<ol>
  <li>Datagov.in
     Here we have data available in various formats like pdf, xlsx, csv etc.</li>
  <li>State Tourism websites (delhitourism.gov.in etc)
     Here we have data in the form of HTML pages which can be fetched using simple python scripts.</li>
  <li>Tourism.gov.in
     Here too, we have data in the form of HTML pages which can be fetched using simple python scripts.</li>
</ol>

<p><strong>2. <a href="https://developers.google.com/maps/documentation/api-picker">Google Maps API</a></strong>
    To determine nearby-places, transit routes, traffic details, name of location, distances of the nearby-places from the Tourist places and other places around the Tourist Places we use Google Maps API. Here the data we get is in json which can easily be used. Here are a few sub-APIs under the Google Maps API,</p>

<ol>
  <li><a href="https://developers.google.com/maps/documentation/javascript/trafficlayer#transit_layer">Transit Layer</a></li>
  <li><a href="https://developers.google.com/maps/documentation/javascript/trafficlayer#bicycling_layer">Bicycle layer</a></li>
  <li><a href="https://developers.google.com/maps/documentation/javascript/trafficlayer">Traffic layer</a></li>
  <li><a href="https://developers.google.com/maps/documentation/javascript/geocoding">Geocoding service</a></li>
  <li><a href="https://developers.google.com/places/documentation/search">Google Places API</a></li>
  <li><a href="https://developers.google.com/maps/documentation/javascript/directions">Directions service</a></li>
  <li><a href="https://developers.google.com/maps/documentation/javascript/places">Google Places library</a></li>
  <li><a href="https://developers.google.com/maps/documentation/javascript/distancematrix">Distance Matrix service</a></li>
  <li><a href="https://developers.google.com/maps/documentation/embed/">Google Maps Embed API</a></li>
  <li><a href="https://developers.google.com/maps/documentation/tracks">Google Maps Tracks API</a></li>
</ol>

<p><strong>3. Google images</strong>
    To determine the relative aesthetic beauty of a place we will have Google images which can tell us how much a Tourist place might appear to be beautiful to a tourist.</p>

<p><strong>4. Travel Agent websites</strong>
    These travel websites have a lot of data about the Hotels in the locality of the tourist place, various details about the conveyance around the tourist place,</p>

<ul>
  <li><a href="http://www.goibibo.com/">Goibibo</a></li>
  <li><a href="http://www.makemytrip.com/">Makemytrip</a></li>
</ul>

<p><strong>5. Restaurants/Food reviewing sites</strong>
    The data for the availability of good reliable restaurants around the tourist place is determined using these reviewing sites along with Google maps API. A famous site in India is <a href="https://www.zomato.com/">Zomato</a> which can be used for the application.</p>

<p><strong>6. Wikipedia</strong>
    Here we have reliable data available for us in the form of HTML pages. These can be easily parsed and looked for the facts and figures. In spite of being written in natural language all the wiki pages are written in the encyclopedic tone for which the tool can be trained to extract some relevant bits of information like nearby places, images, tourism figures etc.</p>

<p>If our application can be fed a proper dataset for the available famous tourists places then it’ll do a much better analysis.</p>

<p>####<strong>How are we solving the above stated problem?</strong></p>

<p>To determine what changes we have to do in a particular tourist place we have divided the improvements in 7 classes,</p>

<p><strong>1. Category of the tourist place</strong>
    We have taken 5 categories-</p>

<ul>
  <li>Historical (eg. Red Fort, Taj Mahal, etc)</li>
  <li>Religious (eg. Akshardham Temple, Kedar Nath, Jama Masjid, etc)</li>
  <li>Hill Stations (eg. Nainital, Ranikhet, etc)</li>
  <li>Adventure (eg. Gangotri, Gulmarg, etc)</li>
  <li>Cultural (eg. Kanyakumari, Mohali, etc)</li>
</ul>

<p>On the basis of above five categories we ask the user, in which category his tourist place lies in. This step decides the dataset which we will be using to create a base for the comparison.</p>

<p><strong>2. Accommodation</strong>
This class contains the analysis on the basis of the Hotels and Motels around the tourist place. We determine the characteristics of a tourist place by determining the number of Hotels (5 star, 3 star, other than 3 star) and Motels around the place and their reviews/ratings.</p>

<p><strong>3. Conveyance</strong>
This class contains the analytics of the,</p>

<ul>
  <li>Bus routes</li>
  <li>Availability of local buses</li>
  <li>Location of bus stops around the tourist place</li>
  <li>Taxi availability</li>
</ul>

<p>Thus, here we create a measurement index for the tourist place on the basis of the conveyance.</p>

<p><strong>4. Near-by Places</strong>
Here we look for the following near-by places,</p>

<ul>
  <li>Museums</li>
  <li>Art Galleries</li>
  <li>Local handicrafts markets</li>
  <li>Malls</li>
  <li>Temples</li>
  <li>Other markets</li>
  <li>
    <p>Other famous places</p>

    <p>One can easily see the connection between the tourist places and these near-by places. One is bound to visit these places if he visited the place. And more the number of these places, more the number of visitors to the place. This is thus one of the judging criteria.</p>
  </li>
</ul>

<p><strong>5.  Overall beauty</strong>
To determine the relative aesthetic beauty of a place we have two things which can tell us how much a Tourist place might appear to be beautiful to a tourist.</p>

<ul>
  <li>User Reviews</li>
  <li>Google Images</li>
</ul>

<p>In the case of user reviews we have various reviews/feedbacks provided by the tourists on various tourist places. Now, the challenges here are that we can never rely on the user reviews; unavailability of an authentic user review system/platform; and even if we cover the previous two issues then we have the reviews written in natural language by a human. So, here the sort of measurement of this aesthetic beauty is very difficult. Hence, the second case.</p>

<p>In second case we have images downloaded from Google. For each tourist place we will be having various images available for us snapped at different angles and distances. By applying image processing methods me will be determining the index as explained below.</p>

<p>The overall beauty is usually due to the combined affects of architecture of the place, scenic beauty, cleanliness of the place, etc. These points are considered to determine how much numerically beautiful a place is. Here too we have categories to determine the measurement of beauty,</p>

<ul>
  <li>Surrounded by trees and greenery</li>
  <li>Well maintained place with paths and cleanliness</li>
</ul>

<p>This above approach is a very basis approach in the sense of determining the score on the basis of 2 factors. This class needs to be more researched upon. With the knowledge we have presently we could only come up with these 2 factors.</p>

<p>####BASE Set for comparison</p>

<p>For base set we take some famous places we know of under each category and gather their data from all the sources specified above.
For the creation of the above set we consider each category, and under each category we gather data for the following,</p>

<p><strong>Accommodation</strong>
    To get data for accommodation we are using Google Maps API, travel agent websites and Governmental data. Here, we create a statistics from the data and come-up with an index with facts and figures. We also come up with an average value for all the index values of the famous places in that category.</p>

<p><strong>Conveyance</strong>
    For the conveyance information, the data will be fetched via Google Maps API and travel agent websites and Governmental tourism sites. With this information on the various factors as stated above a set of values are determined for a place. The average of these values for all the places in the category is also calculated.</p>

<p><strong>Near-by Places</strong>
    Again the data for this is fetched using Google Maps API, Governmental data, Restaurants/Food reviewing sites and wikipedia. Creating a sort of graph where the tourist place is the center and the near-by places connected to the center we can see that a denser graph implies that there are more number of near-by places that might interest the tourists and thus more revenue generation of that region. Thus the density of the graph is going to be a deciding factor here.</p>

<p><strong>Overall beauty</strong>
    The data of the above factors specified in this class will be extracted from the downloaded Google images of a place and wikipedia. On the basis of factors a score will be generated for each place. factors like presence of green color or the uniformity of the color of the structure (signifying the maintenance of the structure by repainting, etc).</p>

<p>####USER’s Data Set</p>

<p>To determine the results of our tool, that is to determine the changes to be suggested for the user’s provided tourist place to increase the footfall there we will need some data to be provided by the user. The required data needed is,</p>

<p><strong>Category of the tourist attraction</strong>
    We will ask for the category of the place user wants us to compare, this will enable our application to have a smaller master data set and better results.
    If other than the five categories is specified then the master dataset will be comprising of the whole data set. That is all the places irrespective of the category.</p>

<p><strong>Accommodation Facility</strong>
    Here we will ask for the excel sheets in a specified format for the details of the Hotels present around the tourist place so that we know how much is already present there. For each class of hotel (5 star, 4 star, 3 star etc) the data will be taken so that we know how much active the place is.</p>

<p><strong>Average Footfall</strong>
    Footfall can be entered in three modes
*Daily
*Monthly
*Yearly
    After this mode an excel sheet should be given having the data according to the mode for the footfall. For eg. daily footfall for last 18 months or monthly data for last 48 months and so on.</p>

<p><strong>Location</strong>
    Latitudes and longitudes of the tourist place will be asked here. This step is to determine the place of the location on the maps and leverage the Google Maps API and use the data fetched to derive the results.</p>

<p><strong>Images</strong>
    Here at least 10 images at at 10 different angles (one from each side, one from each corner and 2 from top) of the tourist place will be asked so that we can determine the aesthetic measurements of the place.</p>

<p>####REPORT
Now, after the analysis is done the report will be generated. Report will comprise of a graphical representation of the things lacking in the tourist attraction when compared with the places in the corresponding category.
Along, with the graph a set of suggestions will be provided in the four points in which the places are compared that is, Accommodation, Conveyance, Near-by Places and Overall beauty. With each suggestion supporting data will also be shown so that user does not think that we were trolling him. That the results are actually based on the real data of real places.
There is also an option with which you can print/export this generated report to have a hard copy.</p>

<p>I’ll have to start developing on this idea and implement it for the results. Lets see how it goes.</p>

<p>SR.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[####Problem statement India is one of the countries having a diverse cultural heritage and thus a wide variety of tourist places. Some of those are still not up to their full potential to be a contributing factor to the India’s economy like other recognized tourist spots. Our tool is focused on the improvement of same tourist places in India which have potential to generate a very good revenue but not generating it. Improvement is suggested on the basis of the previous data of the famous tourist places and present data of the tourist place to be improved.]]></summary></entry><entry><title type="html">Packages I do use</title><link href="https://trigonaminima.github.io/2014/07/packages/" rel="alternate" type="text/html" title="Packages I do use" /><published>2014-07-26T00:00:00+00:00</published><updated>2014-07-26T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/07/packages</id><content type="html" xml:base="https://trigonaminima.github.io/2014/07/packages/"><![CDATA[<p>Thinking of installing a new Distro (Deepin, because of the lovely looks of this relatively new product from China) I refrained myself in apathy of installing all the packages and softwares again in the new Distro. Also, I don’t remember them all. So, now at 3:00 AM in the morning I am documenting all the packages I use. This list will keep on being modified in the future (yeah, yeah, I know you knew that).</p>

<p>Right now I am just listing all of ‘em, later I’ll give details too (and remove this line too (: ).</p>

<p>One more thing, this list will not be much useful to non-geeky people. Although, I doubt that any reader of this post is non-geeky or further I doubt I have any readers at all. Yeah, I know my life is sad, you don’t have to tell me that. Here the list goes in the order of installation.</p>

<h3 id="sublime-text-3"><a href="http://www.sublimetext.com/">Sublime Text 3</a></h3>
<p>It’s worthy of being here as the first. I can’t explain how good it is, it is that good. This text editor has got everything you can imagine and if anything is not there as then there are many plug-ins made by the active developer community. You will even find the plug-ins for the features which you can’t imagine (at least not yet).</p>

<p><strong>Plugins I use</strong></p>

<ul>
  <li><a href="https://sublime.wbond.net/installation">Package Control</a><br />
This is the package where you will find other packages to install. You can think of this as a <em>synaptic</em> for sublime packages. You can install it using the <a href="https://sublime.wbond.net/installation">link</a>.<br />
Press Ctrl+Shift+p and type ‘install packages’, you will get a prompt. Enter the required package name there and it will be installed directly. Thus for the following packages you can follow the same process</li>
  <li><a href="https://bitbucket.org/StephaneBunel/pythonpep8autoformat">Python PEP8 autoformat</a><br />
You might have heard about the PEP8. It is style guidelines for the programming in python. And yes, as you can guess, this plug-in basically does that. It PEP8-ifies the python code.</li>
  <li><a href="https://github.com/randy3k/AlignTab">AlignTab</a><br />
It helps in aligning the code as you want. Perhaps, the link will tell you better.</li>
  <li><a href="https://github.com/jonathandelgado/SublimeTodoReview">TodoReview</a><br />
There are many places in you code where you write a reminder keyword of sorts like (TODO) so that when you revisit the code later you can remind yourself of what you wanted to edit/add/remove. This plug-in helps in that. By default it looks for ‘TODO’, but you can add keywords of your like.</li>
  <li><a href="https://github.com/wbond/sublime_terminal">Sublime Terminal</a><br />
Open Terminal Here menu and keyboard shortcuts for Sublime Text. (Yeah, it was blatantly copied right from its Github repo.)</li>
  <li><a href="https://github.com/Monnoroch/ColorHighlighter">Color Highlighter</a><br />
Just click or move the cursor (or multiple cursors) on the color code e.g. “#FFFFFF” and it’ll be highlighted with its real color.</li>
  <li><a href="https://github.com/SublimeText-Markdown/MarkdownEditing">Markdown Editing</a><br />
A plugin to handle markdown files.</li>
  <li><a href="https://github.com/jisaacks/GitGutter">GitGutter</a><br />
It’ll be better if you look at the link.</li>
  <li><a href="https://github.com/duydao/Text-Pastry">Text-Pastry</a><br />
Play with multiple selections and removals.</li>
  <li><a href="https://sublime.wbond.net/packages/GhostText">GhostText</a><br />
Use Sublime Text to write in your browser. Everything you type in the editor will be instantly updated in the browser (and vice versa). For this you also have to install the Chrome/Firefox plugin listed in the Chrome section.</li>
  <li><a href="https://github.com/SublimeText/Origami">Origami</a><br />
Split your pane any way you want.</li>
  <li><a href="https://sublimecodeintel.github.io/SublimeCodeIntel/">SublimeCodeIntel</a><br />
A code intelligence plug-in for Sublime Text. Copied from the above link. Go check it out.</li>
</ul>

<h3 id="synaptic"><a href="https://help.ubuntu.com/community/SynapticHowto">Synaptic</a></h3>
<p>This essentially do the thing you do via “apt-get” but it gives you the list of packages you can install in a GUI. You can search and select the packages for installation, removal, re-installation etc.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nb">install </span>synaptic</code></pre></figure>

<h3 id="git-and-github"><a href="http://git-scm.com/">Git</a> (and <a href="https://github.com">Github</a>)</h3>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nb">install </span>git</code></pre></figure>

<p>I don’t think I have to tell you anything about <em>git</em>. It is the creation of one and only, <em>Linus Torvalds</em>. Git is an awesome tool being the lifeline for almost all the open source development. Every computer programmer should understand and use version controlling. There are many version control tools like Bazaar, Mercurial, HVN etc. Git it the best out of these (according to me). It’s a command line tool.
And out of many web clients <a href="https://github.com">Github</a> is the best. To know how, you can try it out yourself. This blog is hosted on Github, Github is that awesome.</p>

<h3 id="python27-34">Python2.7, 3.4</h3>
<p>There’s nothing to say here. Just checkout this one of the most widely used programming language at <a href="https://www.python.org/">python.org</a>.</p>

<h3 id="pip"><a href="https://pip.pypa.io/en/latest/">pip</a></h3>
<p>This is the package manager (yeah, apt-get) for the python modules. You can install almost every python module via pip. And when it is combined with <em>virtualenv</em> (Next heading) then it is killer as fuck.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nb">install </span>python-pip</code></pre></figure>

<h3 id="virtualenv"><a href="http://virtualenv.readthedocs.org/en/latest/">virtualenv</a></h3>
<p>This is what I was talking about above. It is a simple tool to create isolated Python environments. If you don’t want to clutter your standard python installation with modules and make it huge then you install this and create a isolated python interpretor. You can create the environment with any version of python. The <em>pip</em> is already generated along with the python interpretor and here you can use this pip to install your modules in the generated python interpretor and make a local environment according to your needs.
For complete understanding and documentation you can checkout this <a href="http://virtualenv.readthedocs.org/en/latest/">link</a>.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>pip <span class="nb">install </span>virtualenv</code></pre></figure>

<h3 id="chrome-browser"><a href="https://www.google.com/chrome/browser/">Chrome Browser</a></h3>
<p>I think this <a href="https://en.wikipedia.org/wiki/Usage_share_of_web_browsers">wiki</a> establishes the fact that this browser is the number one. I have tried both Firefox and Chrome. And let me show you a few points I found worth mentioning here. which makes chrome better that Firefox.</p>

<ol>
  <li>Syncing is better across devices.</li>
  <li>If you don’t worry that much about your history (search patterns) being saved and you are an Android user then using Chrome with <a href="https://www.google.com/landing/now/">Google Now</a> will show you what I am talking about. Whatever you search on your Laptop or PC you can instantly get its history in Google Now, which makes life easier.</li>
  <li><a href="https://chrome.google.com/webstore/category/collection/for_your_desktop">Desktop Apps</a>, install these in your Chrome browser as an app and you can get its launcher directly on  your desktop as if the app is like a software on your system. And these apps are synced across all your machines if you have logged in with the same account.</li>
  <li>I can’t think of any more, just move already. How much do you need?</li>
</ol>

<p><strong>Extensions</strong></p>

<ul>
  <li><a href="https://chrome.google.com/webstore/detail/adblock/gighmmpiobklfepjocnamgkkbiglidom">AdBlock</a><br />
Dude, name suggests everything.</li>
  <li><a href="https://chrome.google.com/webstore/detail/google-keep-notes-and-lis/hmjkmjkepdijhoojdojkdfohbdgmmhki">Google Keep - notes and lists</a><br />
This acts as a Desktop App as I mentioned above. Since my Android and Chrome are linked with same Gmail account, my notes <em>keep</em> synced across devices. This connectivity feels awesome.</li>
  <li><a href="https://chrome.google.com/webstore/detail/google-translate/aapbdbdomjkkjkaonfhkkikfgjllcleb">Google Translate</a><br />
Translates the whole page in a jiff.</li>
  <li><a href="https://chrome.google.com/webstore/detail/hover-zoom/nonjdcjchghhkdoolnlbekcfllmednbl">Hover Zoom</a><br />
Just hover over an image and see its zoomed preview.</li>
  <li><a href="https://chrome.google.com/webstore/detail/pocket/mjcnijlhddpbdemagnpefmlkjdagkogk">Pocket</a><br />
Another desktop app which provides the articles I save on the web offline.</li>
  <li><a href="https://chrome.google.com/webstore/detail/save-to-pocket/niloccemoadcdkdjlinkgdfekeahmflj">Save to Pocket</a><br />
The plugin used to save the articles in the pocket.</li>
  <li><a href="https://chrome.google.com/webstore/detail/session-buddy/edacconmaakjimmfgnblocblbcdcpbko">Session Buddy</a><br />
Save and manage sessions.</li>
  <li>
    <p><a href="https://chrome.google.com/webstore/detail/stylish/fjnbnpbmkenffdnngjfgmeleoegfcffe?hl=en">Stylish</a><br />
Restyle the web with Stylish, a user styles manager. Stylish lets you easily install themes and skins for many popular sites. Some styles which can be done can be searched on the plugin management window itself. Some I use are listed here,</p>

    <ul>
      <li><a href="https://github.com/mdo/github-wide">Wide Github</a></li>
    </ul>
  </li>
  <li><a href="https://chrome.google.com/webstore/detail/the-great-suspender/klbibkeccnjlkjkiokjodocebajanakg">GhostText for Chrome</a><br />
This is the plugin which works with the above mentioned Sublime Text 3 plugin GhostText.</li>
  <li>[The Great Suspender][7n]<br />
Unload, park, suspend tabs to reduce memory footprint of chrome.</li>
  <li><a href="https://chrome.google.com/webstore/detail/markdown-here/elifhakcjgalahccnjkneoccemfahfoa">Markdown Here</a><br />
Write your email in Markdown, then make it pretty.</li>
  <li><a href="7p">Terms of Service; Didn’t Read</a><br />
Get information instantly about websites’ terms of service and privacy policies, with ratings and summaries from the www.tosdr.org.</li>
  <li><a href="https://chrome.google.com/webstore/detail/google-dictionary-by-goog/mgijmajocgfcbeboacabfgobmjgjcoja">Google Dictionary</a><br />
Go to the first extension I listed and read its description.</li>
</ul>

<h3 id="dropbox-for-linux"><a href="https://www.dropbox.com/">Dropbox for Linux</a></h3>
<p>Go to software center and install it.</p>

<p>It is another web-service I am very thankful of. My Dropbox account is linked with my Android, my Laptop and any other machine I want it to. And everything is synced everywhere. Whenever any pic is snapped with my phone it is uploaded to my Dropbox account and then it is downloaded to every linked machine. And whatever I add through the desktop once it gets uploaded it is available everywhere. It has really made my life simpler.</p>

<h3 id="redshift"><a href="http://jonls.dk/redshift/">Redshift</a></h3>
<p>If you know what f.lux (for Windows) is then you know what redshift is. For the ignorants, it adjusts a computer display’s color temperature according to its location and time of day, based on a user specified set of longitude and latitude geographical coordinates, a ZIP Code, or a city name.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nb">install </span>gtk-redshift</code></pre></figure>

<h3 id="bleachbit"><a href="http://bleachbit.sourceforge.net/">Bleachbit</a></h3>
<p>Go to software center and install it.</p>

<p>It removes unnecessary files off your laptop. There are a lot of options provided form web history to bash history.</p>

<h3 id="musicbrainz-picard"><a href="https://musicbrainz.org/doc/MusicBrainz_Picard">MusicBrainz Picard</a></h3>
<p>Go to software center and install it.</p>

<p>This is a product of <a href="https://musicbrainz.org/">MusicBrainz Foundation</a> which, for your information, is a web-service having data about the music you must have. Here data meaning metadata, metadata meaning the data about each song made commercially. Metadata like name of song, name of album, artist, year of release etc. This wiki or music database is organized and maintained by the people like you and me.
Now, what <a href="https://musicbrainz.org/doc/MusicBrainz_Picard">Picard</a> does is help you make your music’s metadata better. You must have had cases where the metadata of you music wasn’t present or  was incorrect like when you download some songs via torrent or other sites and in your song title or artist name some sitename occurs (eg: Coldplay-Shiver-brought to you by-troll.com). I know even if it’s not that big of a deal it still annoys. Picard helps in that. It searches its database and corrects its metadata. It also downloads the album art from it’s databases.</p>

<h3 id="wget"><a href="https://www.gnu.org/software/wget/">wget</a></h3>
<p>Usually, it’s already present in the distro.</p>

<p>Downloads anything - video, audio, image, zip - by just providing a link after the command. It’s a command line tool.</p>

<h3 id="curl"><a href="https://en.wikipedia.org/wiki/CURL">cURL</a></h3>
<p>Helps me in sending all kind of requests, whatever it is - GET, POST, DELETE, PUT etc, to a server. You can also download an html page or some other downloadable file with this tool (yeah, same as wget). It’s a command line <a href="http://curl.haxx.se/">tool</a>.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nb">install </span>curl</code></pre></figure>

<h3 id="youtube-dl"><a href="https://rg3.github.io/youtube-dl/">youtube-dl</a></h3>
<p>You want to download a Youtube video without using any plugin or extension via terminal, then this is the command line tool for you. It can download any video (I haven’t been disappointed yet) off the Internet. Just give it the url of the video you want downloaded.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nb">install </span>youtube-dl</code></pre></figure>

<h3 id="music-player">Music Player</h3>

<ul>
  <li>
    <p><strong><a href="https://www.clementine-player.org/">Clementine</a></strong>
Best music player in terms of functionality and easiness. You will get a hang of it soon enough. With plugins like scrobbler already built in the software.</p>
  </li>
  <li>
    <p><strong><a href="https://github.com/linuxdeepin/deepin-music-player">Deepin Music Player</a></strong>
Since switching to the Deepin linux (yeah I have switched on it) I didn’t need any other music player. It’s good looking with less clutter. What I miss in this player is the absence of scrobbler. I am thinking of adding that support to it and generate a PR to the Deepin guys. Lets see if it happens or not. May be they are already planning or implementing it.</p>
  </li>
</ul>

<h3 id="video-player">Video Player</h3>

<ul>
  <li>
    <p><strong><a href="https://www.videolan.org/vlc/index.html">vlc</a></strong>
Who can deny the dominance of VLC?
It has a large array of settings and functionality. It can play anything. I used to be my default player until…</p>
  </li>
  <li>
    <p><strong><a href="https://github.com/linuxdeepin/deepin-movie">Deepin Movie</a></strong>
Yeah, until I used Deepin Movie. Its looks awesome. Have the basic things that I need in a video player. It has been able to play everything I have played on it yet. No other bullshit. It’s just awesome.</p>
  </li>
</ul>

<h3 id="jekyll"><a href="http://jekyllrb.com/docs/installation/">Jekyll</a></h3>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nb">install </span>jekyll</code></pre></figure>

<p>make sure you have libssl-dev installed:</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh">dpkg <span class="nt">-s</span> libssl-dev</code></pre></figure>

<p>if not, install it:</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>apt-get <span class="nt">-y</span> <span class="nb">install </span>libssl-dev</code></pre></figure>

<p>download Ruby, rubygems, node from here and build them. Then install jekyll by,</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">sudo </span>gem <span class="nb">install </span>jekyll</code></pre></figure>

<p>Jekyll is my blog framework. All this reading you are doing here is due to the jekyll. You can change the theme if you want to. Its much better than Wordpress in every respect.</p>

<h3 id="firefox"><a href="https://www.mozilla.org/en-US/firefox/new/">Firefox</a></h3>
<p>Yeah I gotta include it here. It is a good piece of software by <a href="https://www.mozilla.org/en-US/">Mozila</a>. Here syncing is present but not like Chrome. In Chrome it’s simpler in Firefox you gotta create your account whose credentials I used to forget. So I didn’t get my previously synced data.</p>

<p><strong>Addons</strong></p>

<ul>
  <li>Addblock Plus</li>
  <li>Dictionary Extension</li>
  <li>Down Them All</li>
  <li>Ginger Grammar and Spell Checker</li>
  <li>InstantFox</li>
  <li>Pocket</li>
  <li>Session Manager</li>
  <li>Thumbnail Zoom Plus</li>
  <li>Lazarus</li>
  <li>Evernote Web Clipper</li>
  <li>Disconnect</li>
  <li>Tab Mix Plus</li>
  <li>Video DownloadHelper</li>
  <li>Xmarks</li>
  <li>InvisibleHand</li>
  <li>Hola Unblocker</li>
  <li>Barlesque</li>
</ul>

<h3 id="mackup"><a href="https://github.com/lra/mackup">Mackup</a></h3>
<p>Keep your application settings in sync. If you have Dropbox installed and want to use it to save your config files, that’s super easy. Supported for Linux/OS X.</p>

<h3 id="qt4-designer">Qt4 Designer</h3>
<p>As the name suggests you can design apis with this software instead of making everything with code. It’ll make a .ui file which can easily be converted to a python class to be used with pyqt4.</p>

<hr />

<p>This above list of packages I use will change with time. More additions, removals and modifications will be done in the future.</p>

<p>I started writing this blog post in Linux Mint and am now ending it in Deepin after 2-3 weeks. During this time I was settling in my new virtual home, installing it, re-installing it, re-re-installing it. It was fun doing that. My post evolved during this whole time. And, now it’s time to sign off.</p>

<p>SR.</p>]]></content><author><name>Shivam Rana</name></author><category term="Linux" /><category term="General" /><summary type="html"><![CDATA[Thinking of installing a new Distro (Deepin, because of the lovely looks of this relatively new product from China) I refrained myself in apathy of installing all the packages and softwares again in the new Distro. Also, I don’t remember them all. So, now at 3:00 AM in the morning I am documenting all the packages I use. This list will keep on being modified in the future (yeah, yeah, I know you knew that).]]></summary></entry><entry><title type="html">The kernel challange series: Building and booting the Linux kernel</title><link href="https://trigonaminima.github.io/2014/06/build-compile-linux-kernel/" rel="alternate" type="text/html" title="The kernel challange series: Building and booting the Linux kernel" /><published>2014-06-19T00:00:00+00:00</published><updated>2014-06-19T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/06/build-compile-linux-kernel</id><content type="html" xml:base="https://trigonaminima.github.io/2014/06/build-compile-linux-kernel/"><![CDATA[<p>How I am becoming a Linux kernel developer (at least, I think I am). There will be a series of posts as I get ahead on my becoming a Linux kernel developer quest. These are the ones I have written yet.</p>

<p><a href="/2014/05/writing-linux-kernel-module/">Writing a Linux kernel module</a><br />
<a href="/2014/06/build-compile-linux-kernel/">Building and booting the Linux kernel</a> (this post)</p>

<p>Task 2 is as follows -</p>

<blockquote>
  <p>Download Linus’s latest git tree from git.kernel.org. Build it, install it, and boot it. You can use whatever kernel configuration options you wish to use, but you must enable <em>CONFIG<code class="language-plaintext highlighter-rouge">_</code>LOCALVERSION<code class="language-plaintext highlighter-rouge">_</code>AUTO=y</em>. Show proof of booting this kernel. Bonus points if you do it on a “real” machine, and not a virtual machine (virtual machines are acceptable, but come on, real kernel developers don’t mess around with virtual machines, they are too slow. Oh yeah, we aren’t real kernel developers just yet.)</p>
</blockquote>

<p>Okay, so there’s nothing much to say what I did, but still to document I’ll write.</p>

<p>I first ran</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>make menuconfig</code></pre></figure>

<p>Went to the general settings &amp; selected the automatically update local version and then saved the file as <em>.config</em>. You will have the file named the same in your kernel directory. Now run as root</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>make <span class="o">&amp;&amp;</span> make modules_install <span class="o">&amp;&amp;</span> make <span class="nb">install</span></code></pre></figure>

<p>This won’t end anytime soon so go have a walk or something. I went for dinner, if you were wondering. I did this whole compiling process three times. The first time I didn’t know how much time it will take so, I was watching it compile and install for 15-20 minutes (even after, I was warned by various blogs that it’ll take time). I should have taken some inspiration from this xkcd comic <a href="http://xkcd.com/303/">here</a>.</p>

<p>When the show’s over reboot your machine and in the grub only you will be able to see the kernel version you used. For further proof you can run in your terminal the following command.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span><span class="nb">uname</span> <span class="nt">-a</span></code></pre></figure>

<p>It will give a lot of information in one line along with the kernel version like I got.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">Linux Shivam 3.16.0-rc2y #1 SMP Sun Jun 22 21:24:16 IST 2014 x86_64 x86_64 x86_64 GNU/Linux</code></pre></figure>]]></content><author><name>Shivam Rana</name></author><category term="Linux" /><summary type="html"><![CDATA[How I am becoming a Linux kernel developer (at least, I think I am). There will be a series of posts as I get ahead on my becoming a Linux kernel developer quest. These are the ones I have written yet.]]></summary></entry><entry><title type="html">AngelHacked Weekend</title><link href="https://trigonaminima.github.io/2014/06/angelhack/" rel="alternate" type="text/html" title="AngelHacked Weekend" /><published>2014-06-10T00:00:00+00:00</published><updated>2014-06-10T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/06/angelhack</id><content type="html" xml:base="https://trigonaminima.github.io/2014/06/angelhack/"><![CDATA[<p>So, this weekend of mine was spent in an office in Gurgaon (1-2 hr away from my home), hacking a web service called <strong>Freya</strong>. FYI, <em>hacking</em> means tinkering/playing/building with the technologies, not getting access into private things and steal anything, inscribe it in your mind.</p>

<p>Let’s start with.. Well… Start.
Hackathon called <a href="http://www.angelhack.com/"><strong>AngelHack</strong></a> organized all over the world anually was organized here in Delhi for the 1st time.
I’ve never been to a hackathon, which is (for the readers who don’t know what a hackathon is) a competition where you spend 24-48 hours building a product - an app or web service or API or any other hack. You can take part as an individual or as a team. You can form teams with people you know or with people you meet at the venue. The methods vary depending on the rules, but the aim is to deliver a <a href="http://en.wikipedia.org/wiki/Minimum_viable_product"><em>MVP</em></a> in a short amount of time.
Unfortunately, here around me, in every hackathon one has to make an app or web-service. Even though it’s fun doing that, I am not interested in that. I seek for events where I can work on hardware, on APIs, on data analytics etc. But alas, I have never found one.</p>

<p>Enough with the explaining. Let’s start.</p>

<p>I was in a team of five with other four seniors. Reasons for all of them being seniors were - I have pathetic class mates; never ready to do such things. Always bitch away. Chickens!! (Yeah, Shadab Zafar, that was for you too!!). Also, I didn’t know how to make an Android app or had any experience with any back-end technology like PHP either. I only have a little bit of exp in front-end languages like html and css. I am more of a system programmer where though I am not deep in expertise. So, it was a logical step to get involved with the seniors.</p>

<p><strong>Q. You might ask why I even went there if I didn’t know how to hack?</strong>
<strong>A.</strong> Since, I have never been to any hackathon, I wanted to experience the event; to feel the pace of environment; to present the idea to the live audience and judges and to work within a team.
If I hadn’t gone to this event then I would have never gained that experience. If I hadn’t gone then I would have never known that making a web-service was so simple. All that mattered was a great idea for the product.</p>

<p>Again, I started babbling. I was going to tell you what I actually did.</p>

<p>Firstly, this hackathon was of about 24 hours that started at 1:00 pm on 7th June Saturday and ran till 1:00 pm on Sunday 8th June. There were 5 people in my team, including me viz. <a href="https://github.com/vipulnayyar"><em>Mr Vipul</em></a>, <a href="https://github.com/Pankaj*ksharma"><em>Mr Pankaj</em></a>, <a href="https://github.com/Hammad*haleem"><em>Mr Hammad</em></a> and <a href="https://github.com/Safiyat*"><em>Mr Safiyat</em></a>. Of these the team was divided into 2 sub-teams with <em>Mr Vipul</em> and <em>Mr Hammad</em> in one and another three in one. We had 2 ideas.
<em>Mr Hammad</em> and <em>Mr Vipul</em> worked on <strong>distributed CDN</strong> (I won’t tell you what it is ‘cos I barely understand it. Okay, I will tell you, but later.)
<em>Mr Pankaj</em>, me and <em>Mr Safiyat</em> were working on <strong>Freya</strong>. Now two things.</p>

<p><strong>Q. What was the idea? And how did we come up with this name?</strong>
<strong>A.</strong> First, I’ll talk about the idea. From around 12:00 noon to 3:00 pm on Saturday we just discussed about ideas. Actually, me and <em>Mr Safiyat</em> were just chilling, waiting for others to think of something. They came up with CDN about which we didn’t know anything, so we thought to think of something separate. We (mostly me) came up with two ideas - one was to create a <a href="http://www.yelp.com/"><strong>yelp</strong></a> like web service for India (yeah! There isn’t one). Another one was to create a <strong>platform for NGOs and Social Startups</strong> i.e. to create a common place for all NGOs and social startups to showcase what they have been up to and to provide people a place to get in contact with these orgs.
We took both ideas to <em>Mr Pankaj</em>, as he would know how to develop these things. He liked the NGO one and decided to help us in building that. We obviously didn’t have any problem.
Now let’s talk about the name - [Freya][1]. We had to give the name for the project to the organizers. We searched for Goddess of communication, and got results like - Iris, Saraswati and then we saw the word ‘Freya’. It seemed different and we decided to keep that name.
Later I searched for the name Freya and it was completely different thing. Here, <a href="http://lmgtfy.com/?q=freya"><em>let me Google it for you</em></a> (look at the first link). We were praying for judges not to ask the meaning or inspiration for the name. Well, they didn’t. Phew…</p>

<p>We started working on Freya around 4 pm or 5 pm. We made the database in mysql using phpmyadmin then I moved to making the front-end using bootstrap. <em>Mr Safiyat</em> was writing the PHP functions along with other personal chores. <em>Mr Pankaj</em> was just passing time laying around and helping us from time to time. It took me to complete the basic front-end till 2 am. Then the PHP code and the connecting of front-end and back-end began. I did tit-bits of coding, little bit of UI tweaking and then I slept on a <a href="http://static.giantbomb.com/uploads/original/7/72889/1487261-king_20beanbag_20__20royal_20vinyl.jpg"><em>beanbag</em></a> (I wanted one of those for like forever and sleeping on one just made me want it more.)</p>

<p>Now, two things happened, which I found meaningful. One, <em>Mr Pankaj</em> kept asking me to do one thing or another. If he hadn’t done that then I would have done everything slowly and might not have completed the whole project. And moreover, I liked the pace of that night he was causing. Another one was, after dinner at around 9 pm, me and <em>Mr Safiyat</em> went to a balcony and talked for around 1 hour. We talked about what we want to do, what we have done, what is our life like. Hell, we even talked about our eyesights. It was fascinating to know all that.</p>

<p>After waking up on being nagged by <em>Mr Hammad</em> because my daily alarm disturbed his sleep. Other people in the room said that the alarm was ringing for like 1 hour, but no one cared, neither did I. We had breakfast. <em>Mr Pankaj</em> was awake and had completed almost all the functionality behind the web service. I looked at the code and checked every functionality for the bugs. Some bugs were corrected. And our <a href="http://en.wikipedia.org/wiki/Minimum_viable_product">MVP</a> was complete.
Now came the presentation round. This was my first time at this kind of event so I was nervous. Both <em>Mr Pankaj</em> and <em>Mr Safiyat</em> were asking me to do the presentation. We later settled to do it partially. So during the presentation, <em>Mr Safiyat</em> and <em>Mr Pankaj</em> did the talking and I did the demo.
There were 2 rounds - 3-4 minutes for Round one and 15 minutes for Round two. We did well in <em>Round one</em> and were selected out of 25 other teams for <em>Round two</em>. You know that alone was a great thing for me and my team. Especially for me ‘cos I <em>popped my hackathon cherry</em> at this event. We passed for Round two along with other 9 ideas. Here our presentation wasn’t that good. We <em>tumbled and fell</em> (the third form of feeder song - <a href="https://www.youtube.com/watch?v=2sVSml7Bk3g">Tumble and Fall</a>). So <em>in the end</em> (Linkin Park - <a href="https://www.youtube.com/watch?v=1yw1Tgj9-VU">In the End</a>) we failed and it doesn’t even matter that much. Well, we weren’t feeling that bad, we didn’t even expect to pass the round one.</p>

<p><strong>Q. Who won?</strong>
<strong>A.</strong> I don’t really know who won, but what he made was like distributed content manager. Yeah, I didn’t understand what he did. I guess he got to go to California as a grand prize for an incubator kind of program.</p>

<p><strong>Q. What were other ideas?</strong>
<strong>A.</strong> Some ideas were interesting like a guy made an android app through which he controlled <em>NFS most wanted</em> on his laptop. He also made an API for that. Another was an app through which you can broadcast whatever you write on one phone, without lag on other phones over the network. Another was a quizzing app.
What else… Ah.. My memory. I don’t remember another ones. Fuck it! Lets move to the next question.</p>

<p>Oh.. I forgot!! I know you did too.
<strong>Q. What is <em>distributed CDN</em>?</strong>
<strong>A.</strong> It is like normal <a href="http://en.wikipedia.org/wiki/Content_delivery_network">CDN</a> (Content Delivery Network), but it works on the concept of <a href="https://www.torproject.org/"><em>tor</em></a> or <a href="https://joindiaspora.com/"><em>diaspora</em></a>. Run by the community. The content will be hosted in the individual’s RAM which will be lost after the power off.
Now, do you understand? That’s why I wasn’t telling you. I can’t explain further.</p>

<p>Finally, the last question.
<strong>Q. What I took from all this?</strong>
<strong>A.</strong> I loved the quick pace and fun of the whole process. In a short amount of time one have to create a working prototype of one’s idea. Incorporate many technologies, which you might know or might not know about. And in a short amount of time you might also have to learn the technology. You have to act quickly.
And another thing you learn is working in a controlled environment with a TEAM. You learn to deliver your part on which another’s part depends. You fucking learn so many things. That’s the end of the story.
I wonder why isn’t every process of learning is like this. It basically teaches the required qualities. It teaches it quickly and efficiently. What I don’t see anyone getting here is deep theoretical knowledge.</p>

<p>Adios. SR.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[So, this weekend of mine was spent in an office in Gurgaon (1-2 hr away from my home), hacking a web service called Freya. FYI, hacking means tinkering/playing/building with the technologies, not getting access into private things and steal anything, inscribe it in your mind.]]></summary></entry><entry><title type="html">The kernel challange series: Writing a Linux kernel module</title><link href="https://trigonaminima.github.io/2014/05/writing-linux-kernel-module/" rel="alternate" type="text/html" title="The kernel challange series: Writing a Linux kernel module" /><published>2014-05-31T00:00:00+00:00</published><updated>2014-05-31T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/05/writing-linux-kernel-module</id><content type="html" xml:base="https://trigonaminima.github.io/2014/05/writing-linux-kernel-module/"><![CDATA[<p>How I am becoming a Linux kernel developer (at least, I think I am). There will be a series of these posts as I get ahead on my becoming a Linux kernel developer quest. These are the ones I have written yet.</p>

<p><a href="/2014/05/writing-linux-kernel-module/">Writing a Linux kernel module</a> (this post)<br />
<a href="/2014/06/build-compile-linux-kernel/">Building and booting the Linux kernel</a></p>

<p>Task #1 was to -</p>

<blockquote>
  <p>Write a Linux kernel module, and stand-alone Makefile, that when loaded prints to the kernel debug log level, “Hello World!”  Be sure to make the module unloadable as well. The Makefile should build the kernel module against the source for the currently running kernel, or, use an environment variable to specify what kernel tree to build it against.</p>
</blockquote>

<p>Well, particularly what I found very useful for this task was the book <strong><a href="http://tldp.org/LDP/lkmpg/2.6/html/">The Linux Kernel Module Programming Guide</a></strong> <a href="http://www.tldp.org/LDP/lkmpg/2.6/lkmpg.pdf">(pdf)</a>. Read the first few chapters and you will know the solution. Since I am writing this blog post for a kind of revision for myself I will explain the whole process. You sure can skip it if you want.</p>

<p>First of all, for this task there is no need to download a stable linux kernel, since we are building the kernel against the source for the currently running kernel (ie using the kernel of distro you are using) as the task at hand suggests. And, for the module programming we will obviously be using C language.</p>

<p>Enough talk, lets start,</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code"><pre><span class="c1">//Hello world module</span>

<span class="cp"># include &lt;linux/module.h&gt;      // Needed by all modules
# include &lt;linux/kernel.h&gt;       // Needed for KERN_DEBUG
</span>
<span class="c1">// A non 0 return means init_module failed; module can't be loaded.</span>
<span class="kt">int</span>
<span class="nf">init_module</span><span class="p">()</span>
<span class="p">{</span>
        <span class="n">printk</span><span class="p">(</span><span class="n">KERN_DEBUG</span> <span class="s">"Hello world !!</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">cleanup_module</span><span class="p">()</span>
<span class="p">{</span>
        <span class="n">printk</span><span class="p">(</span><span class="n">KERN_DEBUG</span> <span class="s">"Goodbye world !!</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>The above lines are saved in a file named <em>‘task.c’</em> (you can choose any name you want). Don’t compile it yet. Lets delve a little bit into the code, you can also read about it in the book if you want (I am writing it for myself, remember?).<br />
<strong>‘linux/module.h’</strong> is needed by every kernel module and <strong>‘linux/kernel.h’</strong> is needed for the macro expansion for the <em>printk()</em> log level. <em>printk()</em> is the logging mechanism for the kernel and is used to log information and give warnings.<br />
Kernel modules must have at least two functions: a “start” (initialization) function called <em>init_module()</em> which is called when the module is <em>insmoded</em> into the kernel, and an “end” (cleanup) function called <em>cleanup_module()</em> which is called just before it is <em>rmmoded</em>.</p>

<figure class="highlight"><pre><code class="language-basemake" data-lang="basemake">obj-m := task.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)

all:
        $(MAKE) -C $(KDIR) M=$(PWD) modules
 
clean:
        $(MAKE) -C $(KDIR) M=$(PWD) clean</code></pre></figure>

<p>The above lines are saved in a file named <em>‘Makefile’</em>. Now, opening the terminal in the same directory run</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>make</code></pre></figure>

<p>You should get an output similar to</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh">make <span class="nt">-C</span> /lib/modules/3.14.4/build <span class="nv">M</span><span class="o">=</span>/directory-residing-task-1-directory/task1 modules
make[1]: Entering directory <span class="sb">`</span>/media/minima/163be8fe-eab1-4f49-a06e-d21256f4cf00/linux-3.14.4<span class="s1">'
  CC [M]  /directory-residing-task-1-directory/task1/task.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /directory-residing-task-1-directory/task1/task.mod.o
  LD [M]  /directory-residing-task-1-directory/task.ko
make[1]: Leaving directory `/media/minima/163be8fe-eab1-4f49-a06e-d21256f4cf00/linux-3.14.4'</span></code></pre></figure>

<p>A few files will be generated like ‘task.ko’ and ‘task.o’
Now Load the module by running</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>insmod task.ko</code></pre></figure>

<p>To check if the module is loaded you can print the contents of file ‘/proc/modules’ by running</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span><span class="nb">cat</span> /proc/modules | <span class="nb">grep </span>task</code></pre></figure>

<p>You should get an output like</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> task 12426 0 - Live 0x0000000000000000 <span class="o">(</span>POF<span class="o">)</span></code></pre></figure>

<p>This means that the task is loaded in the memory. Now, to unload the module run</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>rmmod task</code></pre></figure>

<p>This will unload the module. Now, if you again run the following command, you won’t get anything.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span><span class="nb">cat</span> /proc/modules | <span class="nb">grep </span>task</code></pre></figure>

<p>To check the output of module you can run any one of the following commands and you will see the result</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span><span class="nb">cat</span> /var/log/syslog | <span class="nb">tail</span></code></pre></figure>

<p>or</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>dmesg | <span class="nb">tail</span></code></pre></figure>

<p>It will print the last 10 lines of the file ‘/proc/log/syslog’ in which you will also see the lines your module printed like the following lines</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh">  <span class="o">[</span> 7513.661656] Hello world <span class="o">!!</span>
  <span class="o">[</span> 7578.795815] Goodbye world <span class="o">!!</span></code></pre></figure>

<p>Now, one thing though, if you run any command and you see any error having strings similar to ‘Operation not permitted’, prefix that command with ‘sudo’ and enter the password  and run the command or run the command as root. For example -</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span><span class="nb">sudo </span>make</code></pre></figure>

<hr />

<p>Now, if you followed the process you would have been successful in creating your own sweet module. So, how does it feel having created a new module? How do you feel becoming a kernel developer?
Well, well don’t start flying, you (and, me myself) have just made a small hello world module, which does nothing except printing something in the system log.</p>

<p>References-<br />
<a href="http://www.tldp.org/LDP/lkmpg/2.6/lkmpg.pdf">The Linux Kernel Module Programming Guide</a> (pdf)</p>]]></content><author><name>Shivam Rana</name></author><category term="Linux" /><summary type="html"><![CDATA[How I am becoming a Linux kernel developer (at least, I think I am). There will be a series of these posts as I get ahead on my becoming a Linux kernel developer quest. These are the ones I have written yet.]]></summary></entry><entry><title type="html">‘Gaming the Github streak’ as Shadab said!!</title><link href="https://trigonaminima.github.io/2014/05/gaming-github/" rel="alternate" type="text/html" title="‘Gaming the Github streak’ as Shadab said!!" /><published>2014-05-30T00:00:00+00:00</published><updated>2014-05-30T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/05/gaming-github</id><content type="html" xml:base="https://trigonaminima.github.io/2014/05/gaming-github/"><![CDATA[<p>You should read the following Blog post first - <a href="http://dufferzafar.github.io/blog/2013/12/21/gaming-the-github-streak/">Gaming the Github Streak</a>, written by my very good but asshole friend Shadab on his blog, Duffer’s Log. Yeah, he is a real duffer!</p>

<p>Anyways, what he has done in the post, I used today. I, personally am not a big fan of these streaks but when my mere current Github streak of 6 days (yeah it was the highest I have attained yet) went to 0 when yesterday I didn’t push anything on Github I felt bad, so I tried to cheat (Bwaahaha).</p>

<p>I changed the system date to 29 and did some edits to a file and pushed it. And hell yeah, it worked. Thus I have now a current as well as longest Github streak of 8 (again, it’s my highest yet).
Now lets see how long I maintain the current streak. Adios.</p>]]></content><author><name>Shivam Rana</name></author><category term="Git" /><summary type="html"><![CDATA[You should read the following Blog post first - Gaming the Github Streak, written by my very good but asshole friend Shadab on his blog, Duffer’s Log. Yeah, he is a real duffer!]]></summary></entry><entry><title type="html">Children: The first priority #Vote4Children</title><link href="https://trigonaminima.github.io/2014/05/vote4children/" rel="alternate" type="text/html" title="Children: The first priority #Vote4Children" /><published>2014-05-13T00:00:00+00:00</published><updated>2014-05-13T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/05/vote4children</id><content type="html" xml:base="https://trigonaminima.github.io/2014/05/vote4children/"><![CDATA[<p>Before actually starting over lets discuss the facts we have in India</p>

<ul>
  <li>440 million or 40% of India’s Population are children.</li>
  <li>Over 14 million children remain out of school.</li>
  <li>After five years of classes, fewer than 60% can read a short story or do simple arithmetic.</li>
  <li>Official figures indicate that there are over 12 million child workers in India, but many NGOs reckon the real figure is up to 60 million.</li>
  <li>Article 24 of India’s constitution prohibits child labor.</li>
  <li>Outside of agriculture, child labor is observed in almost all informal sectors of the Indian economy.
Above facts are taken from <a href="http://en.wikipedia.org/wiki/Child_labour_in_India">Wikipedia</a> and these <a href="http://www.friendsofsbt.org/statistics">stats</a> .</li>
</ul>

<p>Education is both the means as well as the end to a better life; means, because it empowers an individual to earn his/her livelihood and the end because it increases one’s awareness on a range of issues – from health-care to appropriate social behavior to understanding one’s rights, and in the process evolve as a better citizen and live a prosperous life. Why children you ask? It is because if children are given a quality education then what we will achieve is an informed and responsible citizen, an empowered individual. As they say - The children are the future of our country. Well they are the future no doubt in that, but how they being the future will help if they are not going to be educated? How will they take informed decisions for their country? How will they form the government that promotes a healthy democratic environment? How will they establish businesses that are a part of fortune 500 companies? How will they make our country a developed country? And, if children are not aware of the consequences of their actions, then how will they make EARTH a better place to live?</p>

<p>Sadly in India, Several problems persist: issues of ‘social’ distance – arising out of caste, class and gender differences – deny children equal opportunities. Child labor in some parts of the country and resistance to sending girls to school remain real concerns. Lack of awareness in rural India leading to undermining the need for education. As the stats revealed, 14 million children remain out of school, which speaks for itself about the situation. There are some places where the children and the parents are ready to be educated, but there are not enough resources. There the education system faces a shortage of resources, schools, classrooms and teachers. There are also concerns relating to teacher training, the quality of the curriculum, assessment of learning achievements and the efficacy of school management. Given the scarcity of quality schools, many children drop out before completing five years of primary education; many of those who stay on learn little.</p>

<p>There are many individuals and organizations which have taken the responsibility to tackle this grave situation of child education in India. Organizations like <strong><a href="https://www.savethechildren.in/">Save the Children</a></strong>, <strong><a href="http://goonj.org/">Goonj</a></strong>, <strong><a href="http://www.smilefoundationindia.org/">Smile Foundation</a></strong> to name a few are involved proactively in the process of development of children.</p>

<p>Take for example the Save the Children’s - <strong>“Children’s Manifesto”</strong> <a href="http://www.savethechildren.in/images/manifesto_final.pdf">(PDF)</a> for the Lok Sabha Election 2014, which takes children’s Education, Health and Protection into account. Governments rarely prioritize children, and fail to recognize that they have rights. Realizing this the Children’s Manifesto is written to appeal to all the political parties to give significant attention to issues related to child health, education and protection.</p>

<p>Some sources of similar articles and movements-</p>

<ul>
  <li><a href="http://www.savethechildren.in/images/manifesto_final.pdf">Children’s Manifesto</a></li>
  <li><a href="http://www.unicef.org/india/children_2359.htm">UNICEF</a></li>
  <li><a href="https://www.savethechildren.in/">Save the Children</a></li>
  <li><a href="http://goonj.org/">Goonj</a></li>
  <li><a href="http://www.smilefoundationindia.org/">Smile Foundation</a></li>
</ul>

<p>Further if you really care about the future of our nation; about their health, education and protection then do sign the pledge <strong><a href="http://www.youthkiawaaz.com/vote4children">here</a></strong>. By signing the pledge you will be taking the first step in telling the decision makers that this pathetic situation needs to change.</p>

<p>This post is a part of the #Vote4Children Blog-a-thon on Youth Ki Awaaz. Find out more at: http://www.youthkiawaaz.com/vote4children</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[Before actually starting over lets discuss the facts we have in India]]></summary></entry><entry><title type="html">‘Open Terminal Here’ hotkeyed</title><link href="https://trigonaminima.github.io/2014/05/terminal-hotkeyed/" rel="alternate" type="text/html" title="‘Open Terminal Here’ hotkeyed" /><published>2014-05-04T00:00:00+00:00</published><updated>2014-05-04T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/05/terminal-hotkeyed</id><content type="html" xml:base="https://trigonaminima.github.io/2014/05/terminal-hotkeyed/"><![CDATA[<p>I expected that my first actual post would be about something related to Philosophy or Computer or Physics that is, anything pseudo intellectual, but I am going to write about the problem I faced in Linux. The problem was to have a hotkey for “Open in Terminal” (in Linux Mint) with variants as “Open Terminal here” (in Ubuntu). Without the hotkey one have to right click in the respective directory opened in file manager (Nemo in Linux Mint) and then select “Open in Terminal”. 
Having a hotkey for this comes in handy and makes the life simpler. I googled it (Yeah, some people couldn’t even do that properly, even <strong>keyboard ninjas</strong>), looked through a few posts of Linux mint forum and found the solution as follows.</p>

<p>Copy the following code snippet and place the code file (named anything you want) in  - <em>”~/.gnome2/nemo-scripts”</em></p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c">#!/bin/bash</span>
<span class="nb">cd</span> <span class="nv">$NEMO_SCRIPT_CURRENT_URI</span>
<span class="nb">exec </span>gnome-terminal</code></pre></figure>

<p>After the above step make the file executable by running the command in current directory or you can also got to the properties and checking on the “Allow executing file as program” in permissions tab.</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nb">chmod</span> +x your-file.sh</code></pre></figure>

<p>Now waiting for a few minutes for the regeneration of accels file open the file <em>”~/.gnome2/accels/nemo”</em>. Find the line similar to</p>

<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="p">;</span> <span class="o">(</span>gtk_accel_path <span class="s2">"&lt;Actions&gt;/ScriptsGroup/script_file:</span><span class="se">\\</span><span class="s2">s</span><span class="se">\\</span><span class="s2">s</span><span class="se">\\</span><span class="s2">shome</span><span class="se">\\</span><span class="s2">sd</span><span class="se">\\</span><span class="s2">s.gnome2</span><span class="se">\\</span><span class="s2">snemo-scripts</span><span class="se">\\</span><span class="s2">sopen-terminal"</span> <span class="s2">""</span><span class="o">)</span></code></pre></figure>

<p>Edit it by removing the ‘;’ (uncommenting it) from the beginning and add your own hotkey in between the “ “.
For example if you want <strong>ctrl+j</strong> as hotkey here it becomes <strong><code class="language-plaintext highlighter-rouge">&lt;Primary&gt;</code>j</strong>. In place of <strong>Primary</strong> you can use <strong>Alt</strong> or <strong>Shift</strong> or any combination of these.</p>

<p>You can checkout the Forum answer <strong><a href="http://forums.linuxmint.com/viewtopic.php?f=90&amp;t=146565#p773382">here</a></strong>.</p>]]></content><author><name>Shivam Rana</name></author><category term="Linux" /><summary type="html"><![CDATA[I expected that my first actual post would be about something related to Philosophy or Computer or Physics that is, anything pseudo intellectual, but I am going to write about the problem I faced in Linux. The problem was to have a hotkey for “Open in Terminal” (in Linux Mint) with variants as “Open Terminal here” (in Ubuntu). Without the hotkey one have to right click in the respective directory opened in file manager (Nemo in Linux Mint) and then select “Open in Terminal”. Having a hotkey for this comes in handy and makes the life simpler. I googled it (Yeah, some people couldn’t even do that properly, even keyboard ninjas), looked through a few posts of Linux mint forum and found the solution as follows.]]></summary></entry><entry><title type="html">Hello World!!</title><link href="https://trigonaminima.github.io/2014/04/hello-world/" rel="alternate" type="text/html" title="Hello World!!" /><published>2014-04-08T00:00:00+00:00</published><updated>2014-04-08T00:00:00+00:00</updated><id>https://trigonaminima.github.io/2014/04/hello-world</id><content type="html" xml:base="https://trigonaminima.github.io/2014/04/hello-world/"><![CDATA[<p>Hello World!!</p>

<p>This line being a classic way to start anything in the Computer programming world whether it is the first program in a programming language or a hello world OS, seemed to me the only way to start blogging. I don’t know what happens to people who break with this tradition, but I think it’s safer not to find out.</p>

<p>What I hope to write here is about the things concerning me or interesting to me. Things like - Physics, Computers, Philosophy, Psychology, Biology, reading, writing, etc. I guess it basically means anything and everything.</p>

<p>I also intend to write about the common problems I face while working on anything and may be post their solutions too. And the things I engage myself in. Lets just hope that I will be writing frequently on this Blog.</p>

<p>Let the Force be strong with this Blog.</p>]]></content><author><name>Shivam Rana</name></author><category term="General" /><summary type="html"><![CDATA[Hello World!!]]></summary></entry></feed>