Jekyll2022-01-08T19:29:40+00:00https://genzstoic.xyz/feed.xmlan unexamined life is not worth livinga place where I try to structure my thoughts and think about new concepts or ideas by writing them down.Basic introduction to Bayesian Methodology2021-12-15T14:15:40+00:002021-12-15T14:15:40+00:00https://genzstoic.xyz/education/stats&probability/2021/12/15/intro-to-bayesian-theorem<h1 id="bayesian-methodology">Bayesian Methodology</h1>
<h3 id="bayesian-way-of-thinking">Bayesian way of thinking</h3>
<blockquote>
<p>You have a <strong>belief</strong> that probability of getting a head(let’s say 0.8) is more than the tail because of your <strong>past experience</strong>. You challenge your friend in a long coin flipping contest for a period of 12 months. After every month, you <strong>update</strong> your previous belief based on <strong>new evidence</strong> of the coin flip and hence readjust the probability. After the end of the year, your belief of getting a head is 0.5 with some <strong>uncertainity</strong> involved.</p>
</blockquote>
<ul>
<li>In essence, Bayesian methodology is about having a prior belief <strong>(prior probability)</strong> of an event and updating that belief on seeing new evidence <strong>(likelihood)</strong> and hence as a result forming new belief <strong>(posterior probability)</strong> about the event.</li>
</ul>
<h3 id="advantages-of-using-a-bayesian-approach">Advantages of using a Bayesian approach</h3>
<ul>
<li>
<p>Bayes’s theorem helps in quantifying the <strong>uncertainity</strong> of an event using probabilistic model which differentiates it from the frequentist approach where a point estimate is calculated which doesn’t give much insight about the event.</p>
</li>
<li>
<p>Bayesian approach also integrates the domain knowledge(prior) of the event that we are modeling, thereby, providing a informative starting point for our model. Though, also note that an uninformative prior can be created to reflect a balance among outcomes when no information is available.</p>
</li>
</ul>
<h3 id="random-variable">Random variable</h3>
<ul>
<li>
<p>A random variable is described as a variable whose values depend on a random process.</p>
<ul>
<li>For example, in the case of our coin flip challenge a variable <code class="language-plaintext highlighter-rouge">X</code> could be considered as a random variable whose value could either be 1(head) or 0(tail) depending on the outcome of coin flip(random process)</li>
<li>Here <code class="language-plaintext highlighter-rouge">X</code> is a <strong>discrete random variable</strong> as it can only take discrete values. Discrete data can only take certain values, for example: number of students in a class, we can’t have half a student.</li>
</ul>
<p align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Random_Variable_as_a_Function-en.svg/440px-Random_Variable_as_a_Function-en.svg.png" style="background:#FFFFFF;padding: 0px 10px 0px" />
</p>
<ul>
<li>Temperature, age, height, weight are all examples of <strong>continuous random variable</strong> as it can take infinitely many values. For example, a random variable measuring the time taken for something to be done is continuous since there are an infinite number of possible times that can be taken.</li>
</ul>
</li>
</ul>
<h3 id="probability-distribution">Probability Distribution</h3>
<ul>
<li>
<p>It can be considered as a function which takes in a random variable, let’s say <code class="language-plaintext highlighter-rouge">X</code> and assign probabilities to each of it’s outcome.</p>
<ul>
<li><strong>Bernoulli distribution</strong> can be used to represent the random variable <code class="language-plaintext highlighter-rouge">X</code> modelling two possible outcomes. A bernoulli distribution is discrete, as opposed to continuous, since only 1 or 0 is a valid response.</li>
</ul>
<p align="center">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRq5ZCwpkt6VoWgNGKJMI1c7jypa5eUUnz0YgEK1gX0WF1DKuQl" />
</p>
<ul>
<li>The most commonly used distribution is the <strong>Gaussian distribution</strong>,also referred as <strong>Normal distribution</strong> or <strong>bell curve</strong> is used frequently in finance, investing, science, and engineering.</li>
</ul>
<p align="center">
<img src="https://miro.medium.com/max/2000/1*s_4OIdPSzuZcevhBdOmCQA.png" style="background:#FFFFFF;padding: 0px 10px 0px" />
</p>
<p align="center">
<img src="https://miro.medium.com/max/259/1*6_123CZ7Ni9WfTXdtlxDcw.jpeg" />
</p>
<ul>
<li>
<p>It is so popular mainly because of three reasons:</p>
<ol>
<li><strong>Ubiquitous in natural phenomena</strong></li>
</ol>
<ul>
<li>Incredible number of processes in nature and social sciences naturally follows the Gaussian distribution. Even when they don’t, the Gaussian gives the best model approximation for these processes.</li>
</ul>
<ol>
<li><strong>Mathematical Reason: Central Limit Theorem</strong></li>
</ol>
<ul>
<li>Central limit theorem states that when we add large number of independent random variables, irrespective of the original distribution of these variables, their normalized sum tends towards a Gaussian distribution. For example, the distribution of total distance covered in an random walk tends towards a Gaussian probability distribution.</li>
<li>Once a Gaussian, always a Gaussian!
Unlike many other distribution that changes their nature on transformation, a Gaussian tends to remain a Gaussian.
<ul>
<li>Product of two Gaussian is a Gaussian</li>
<li>Sum of two independent Gaussian random variables is a Gaussian</li>
<li>Convolution of Gaussian with another Gaussian is a Gaussian</li>
<li>Fourier transform of Gaussian is a Gaussian</li>
</ul>
</li>
</ul>
<ol>
<li><strong>Simplicity</strong></li>
</ol>
<ul>
<li>For every Gaussian model approximation, there may exist a complex multi-parameter distribution that gives better approximation. But still Gaussian is preferred because it makes the math a lot simpler!</li>
<li>Its mean, median and mode are all same</li>
<li>The entire distribution can be specified using just two parameters- mean and variance</li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="mathematical-representation-of-bayess-theorem-is-given-by">Mathematical representation of Bayes’s theorem is given by,</h3>
<p align="center">
<img src="https://miro.medium.com/max/1400/1*CnoTGGO7XeUpUMeXDrIfvA.png" width="70%" alt="Bayes' Theorem" />
</p>
<blockquote>
<p>There are specific techniques that can be used to quantify the probability for multiple random variables, such as the <code class="language-plaintext highlighter-rouge">joint</code>, <code class="language-plaintext highlighter-rouge">marginal</code>, and <code class="language-plaintext highlighter-rouge">conditional probability</code>. These techniques provide the basis for a probabilistic understanding of fitting a predictive model to data.</p>
</blockquote>
<h3 id="joint-probability">Joint Probability</h3>
<p><em>The probability of two (or more) events is called the joint probability.</em></p>
<p>For example, the joint probability of event A and event B is written formally as: <code class="language-plaintext highlighter-rouge">P(A and B)</code></p>
<p>The “and” or conjunction is denoted using the upside down capital “U” operator “^” or sometimes a comma “,”.</p>
<p><code class="language-plaintext highlighter-rouge">P(A ^ B), P(A, B), P(AB)</code></p>
<p>The joint probability for events A and B is calculated as the probability of event A given event B multiplied by the probability of event B.</p>
<p>This can be stated formally as follows: <code class="language-plaintext highlighter-rouge">P(A and B) = P(A given B) * P(B)</code></p>
<ul>
<li>
<p>The calculation of the joint probability is sometimes called the fundamental rule of probability or the <strong>“product rule”</strong> of probability or the <strong>“chain rule”</strong> of probability.</p>
</li>
<li>
<p>Here, <code class="language-plaintext highlighter-rouge">P(A given B)</code> is the probability of event A given that event B has occurred, called the conditional probability, described below.</p>
</li>
<li>
<p>The joint probability is <em>symmetrical</em>, meaning that <code class="language-plaintext highlighter-rouge">P(A and B)</code> is the same as <code class="language-plaintext highlighter-rouge">P(B and A)</code>. The calculation using the conditional probability is also symmetrical, for example:</p>
<p><code class="language-plaintext highlighter-rouge">P(A and B) = P(A given B) * P(B) = P(B given A) * P(A)</code></p>
</li>
</ul>
<h3 id="marginal-probability">Marginal Probability</h3>
<p><em>Marginal probability is the probability of an event irrespective of the outcome of another variable.</em></p>
<ul>
<li>
<p>The probability of the evidence P(B) can be calculated using the law of total probability. If <code class="language-plaintext highlighter-rouge">P(B)</code> is a partition of the sample space, which is the set of all outcomes of an experiment, then,</p>
<p><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/8867372a5e203835363cbad1ae1cbaa1defb15af" style="background:#FFFFFF;" /></p>
</li>
<li>When there are an infinite number of outcomes, it is necessary to integrate over all outcomes to calculate <code class="language-plaintext highlighter-rouge">P(B)</code> using the law of total probability.</li>
<li>
<p>Often, <code class="language-plaintext highlighter-rouge">P(B)</code> is difficult to calculate as the calculation would involve sums or integrals that would be time-consuming to evaluate, so often only the product of the prior and likelihood is considered, since the evidence does not change in the same analysis. The posterior is proportional to this product:</p>
<p><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/e1a83fc9b2788b4a72bbc4c90d06c67bb7e0fdae" style="background:#FFFFFF;" /></p>
</li>
<li>The <strong>maximum a posteriori</strong>, which is the mode of the posterior and is often computed in Bayesian statistics using mathematical optimization methods, remains the same. The posterior can be approximated even without computing the exact value of <code class="language-plaintext highlighter-rouge">P(B)</code> with methods such as <strong>Markov chain Monte Carlo</strong> or <strong>Variational inference</strong> Bayesian methods.</li>
</ul>
<h3 id="conditional-probability">Conditional Probability</h3>
<p><em>The probability of one event given the occurrence of another event is called the conditional probability.</em></p>
<ul>
<li>
<p>Types of event:</p>
<ul>
<li>
<p><strong>Independent event:</strong> Each event is not affected by any other events.</p>
<p>For example: Tossing of a coin, each toss of a coin is a perfect isolated thing. What it did in the past will not affect the current toss.</p>
</li>
<li>
<p><strong>Dependent event:</strong> They can be affected by previous events</p>
<p>For example: Marbles in a bag.</p>
<p>2 blue and 3 red marbles are in a bag. What are the chances of getting a blue marble? The chance is 2 in 5, but after taking one out the chances change!</p>
<p>So the next time:</p>
<ul>
<li>
<p>If we got a red marble before, then the chance of a blue marble next is 2 in 4</p>
</li>
<li>
<p>If we got a blue marble before, then the chance of a blue marble next is 1 in 4</p>
<p><img src="https://www.mathsisfun.com/data/images/probability-marbles1.svg" /></p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">P(B|A)</code> is also called the <strong>“Conditional Probability”</strong> of B given A.</p>
<p>And in our case: <code class="language-plaintext highlighter-rouge">P(B|A) = 1/4</code></p>
<p>So the probability of getting 2 blue marbles is:</p>
<p><img src="https://www.mathsisfun.com/data/images/probability-marbles-tree4.svg" /></p>
<p>And we write it as:</p>
<p><img src="https://www.mathsisfun.com/data/images/probability-independent-formula1.svg" /></p>
<p>Probability of <strong>event A</strong> and <strong>event B</strong> equals the probability of <strong>event A</strong> times the probability of <strong>event B given event A.</strong></p>
<p>Using Algebra we can also “change the subject” of the formula, like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Start with:
P(A and B) = P(A) x P(B|A)
Swap sides:
P(A) x P(B|A) = P(A and B)
Divide by P(A):
P(B|A) = P(A and B) / P(A)
</code></pre></div></div>
<p>And we have another useful formula:</p>
<p><img src="https://www.mathsisfun.com/data/images/probability-independent-formula2.gif" /></p>
<p>The probability of <strong>event B given event A</strong> equals the probability of <strong>event A and event B</strong> divided by the probability of <strong>event A.</strong></p>
<h1 id="coin-flip-example">Coin flip example</h1>
<details>
<summary style="cursor: pointer;"><b>View Code</b></summary>
<pre><code lang="python">
%matplotlib inline
import scipy.stats as stats
import numpy as np
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
import tensorflow as tf
import tensorflow_probability as tfp
import collections
tfd = tfp.distributions
# tensorflow eager: tf ops immediately evaluated
# and produced as result
try:
tf.compat.v1.enable_eager_execution()
except ValueError:
pass
class _TFColor(object):
"""Enum of colors used in TF docs."""
red = '#F15854'
blue = '#5DA5DA'
orange = '#FAA43A'
green = '#60BD68'
pink = '#F17CB0'
brown = '#B2912F'
purple = '#B276B2'
yellow = '#DECF3F'
gray = '#4D4D4D'
def __getitem__(self, i):
return [
self.red,
self.orange,
self.green,
self.blue,
self.pink,
self.brown,
self.purple,
self.yellow,
self.gray,
][i % 9]
TFColor = _TFColor()
# Build Graph
rv_coin_flip_prior = tfd.Bernoulli(probs=0.5, dtype=tf.int32)
num_trials = tf.constant([0,1, 2, 3, 4, 5, 8, 15, 50, 500, 1000, 2000])
coin_flip_data = rv_coin_flip_prior.sample(num_trials[-1])
# prepend a 0 onto tally of heads and tails, for zeroth flip
coin_flip_data = tf.pad(coin_flip_data,tf.constant([[1, 0,]]),"CONSTANT")
# compute cumulative headcounts from 0 to 2000 flips,
# and then grab them at each of num_trials intervals
cumulative_headcounts = tf.gather(tf.cumsum(coin_flip_data), num_trials)
rv_observed_heads = tfd.Beta(
concentration1=tf.cast(1 + cumulative_headcounts, tf.float32),
concentration0=tf.cast(1 + num_trials - cumulative_headcounts, tf.float32))
probs_of_heads = tf.linspace(start=0., stop=1., num=100, name="linspace")
observed_probs_heads = tf.transpose(rv_observed_heads.prob(probs_of_heads[:, tf.newaxis]))
def evaluate(tensors):
"""Evaluates Tensor or EagerTensor to Numpy `ndarray`s.
Args:
tensors: Object of `Tensor` or EagerTensor`s; can be `list`, `tuple`,
`namedtuple` or combinations thereof.
Returns:
ndarrays: Object with same structure as `tensors` except with `Tensor` or
`EagerTensor`s replaced by Numpy `ndarray`s.
"""
if tf.executing_eagerly():
return tf.contrib.framework.nest.pack_sequence_as(
tensors,
[t.numpy() if tf.contrib.framework.is_tensor(t) else t
for t in tf.contrib.framework.nest.flatten(tensors)])
return sess.run(tensors)
# Execute graph
[num_trials_,
probs_of_heads_,
observed_probs_heads_,
cumulative_headcounts_,
] = evaluate([
num_trials,
probs_of_heads,
observed_probs_heads,
cumulative_headcounts
])
# For the already prepared, I'm using Binomial's conj. prior.
plt.figure(figsize(16.0, 9.0))
for i in range(len(num_trials_)):
sx = plt.subplot(len(num_trials_)/2, 2, i+1)
plt.xlabel("$p$, probability of heads") \
if i in [0, len(num_trials_)-1] else None
plt.setp(sx.get_yticklabels(), visible=False)
plt.plot(probs_of_heads_, observed_probs_heads_[i],
label="observe %d tosses,\n %d heads" % (num_trials_[i], cumulative_headcounts_[i]))
plt.fill_between(probs_of_heads_, 0, observed_probs_heads_[i],
color=TFColor[3], alpha=0.4)
plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)
leg = plt.legend()
leg.get_frame().set_alpha(0.4)
plt.autoscale(tight=True)
plt.suptitle("Bayesian updating of posterior probabilities", y=1.02,
fontsize=14)
plt.tight_layout()
</code></pre>
</details>
<p align="center">
<img src="/assets/img/coin-flip.png" style="background:#FFFFFF;" />
</p>
<p><br /></p>
<p style="color:grey;font-size:15px"><i>This was originally written in a jupyter notebook which can be accessed <a href="https://github.com/sr1jan/ProbabilisticGraphicalModelling/blob/master/PGM.ipynb" target="_blank">here</a>.</i></p>Bayesian MethodologyDoes GenZ have a shorter attention span?2020-12-24T14:15:40+00:002020-12-24T14:15:40+00:00https://genzstoic.xyz/thoughts/2020/12/24/genz-attention-span<h3 id="intuitive-answer-yes">intuitive answer: yes!</h3>
<ul>
<li><strong>reasoning</strong>: harder to focus on one thing at a time.
<ul>
<li>TL;DR - frequent context switching to ensure constant or increasing flow of dopamine.</li>
<li>it doesn’t matter if I am consuming low or high-quality dopamine (low: watching reaction videos, high: working on a side project; <del>subjective to your life philosophy</del>).</li>
<li>if I find a higher dopamine source I’ll do the context switch (consciously ignoring any procrastination cue).</li>
<li><strong>case study</strong>: me working on a side project
<ul>
<li>focus time: setup git repo; open and skim through required documentation, reference project, vi editor; initial boilerplate code; push an initial commit. (~30mins, feel: I started something, deserve a break now)</li>
<li>interval time: twitter; skim through some articles; check IG, whatsapp (constantly trying to maintain or increase the dopamine levels)
<ul>
<li>maybe let’s go back to work? YES! but let’s get some coffee first.</li>
<li>makes coffee; more twitter; some IG; etc (~25mins, feel: productivity degradation, should work now)</li>
</ul>
</li>
<li>hence this cycle keeps on going until the focus time ceases to exist.</li>
<li><strong>causes behind this degradation of focus time</strong>:
<ul>
<li>relatively easy dopamine source available elsewhere (thanks to digital age corporations optimizing for maximum engagements because of their advertisement business model)</li>
<li>emotional brain triumph over the logical brain (instant gratification)</li>
</ul>
</li>
</ul>
</li>
<li><strong>other examples</strong>:
<ul>
<li>reading youtube comments while watching the video.</li>
<li>using multiple apps on multiple devices at once.</li>
<li>constantly switching between apps to ensure a constant flow of dopamine rush.</li>
<li>exponential growth of interval time leading to degradation of focus time.</li>
<li>conscious ignorance of procrastination cues.</li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="a-common-counter-argument-that-ive-noticed-recently">a common counter-argument that I’ve noticed recently:</h3>
<ul>
<li><a href="https://twitter.com/Julian/status/1335270227727151107">“People don’t have shorter attention span”</a>. they finish 3 hour Joe Rogan episodes, they binge 14 hours shows.
<ul>
<li>essentially alluding to the point that a provider must guarantee and convince its consumer in the starting few minutes of consumption that they’ll receive ever-increasing levels of dopamine.</li>
</ul>
</li>
<li>IMO, although this point does make sense but is limited to easy or passive material.</li>
<li>one could consume a high-quality engaging material on <em><code class="language-plaintext highlighter-rouge">microservice architecture</code></em> but unless they have built such a system themselves it will never transfer as a skill (even psychologically).</li>
<li>and the only way to actively do something is by embracing delayed gratification and triumphing over the procrastination cues.</li>
</ul>
<h3 id="why-just-genz-and-not-the-previous-generations">why just genz and not the previous generations?</h3>
<ul>
<li>wanted to limit the scope of contemplation.</li>
<li>I am part of genz.</li>
<li>there is a trade-off that one must consider when talking about genz and previous generations since growing with the internet genz do get access to information quite easily but on the other side they are exposed to a lot more noise and cheap dopamine.</li>
<li>IMO, we are better off than previous generations if we can develop and strengthen <a href="https://en.wikipedia.org/wiki/Critical_thinking">critical thinking</a> and <a href="https://www.lesswrong.com/rationality">rationality</a>.</li>
</ul>intuitive answer: yes!