Add text to PyMC3 section NB2

Hugo Bowne-Anderson · web-flow · commit 0a5baee8d4c6 · 2018-07-07T16:14:50.000-04:00
diff --git a/notebooks/2.Parameter_estimation_hypothesis_testing.ipynb b/notebooks/2.Parameter_estimation_hypothesis_testing.ipynb
@@ -264,13 +264,54 @@
     "## 3. Bayesian parameter estimation using PyMC3"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Well done! You've learnt the basics of Bayesian model building. The steps are\n",
+    "1. To completely specify the model in terms of _probability distributions_. This includes specifying \n",
+    "    - what the form of the sampling distribution of the data is _and_ \n",
+    "    - what form describes our _uncertainty_ in the unknown parameters (This formulation is adapted from [Fonnesbeck's workshop](https://github.com/fonnesbeck/intro_stat_modeling_2017/blob/master/notebooks/2.%20Basic%20Bayesian%20Inference.ipynb) as Chris said it so well there).\n",
+    "2. Calculate the _posterior distribution_.\n",
+    "\n",
+    "In the above, the form of the sampling distribution of the data was Binomial (described by the likelihood) and the uncertainty around the unknown parameter $p$ captured by the prior."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now it is time to do the same using the **probabilistic programming language** PyMC3. There's _loads_ about PyMC3 and this paradigm, two of which are\n",
+    "- _probabililty distributions_ are first class citizens, in that we can assign them to variables and use them intuitively to mirror how we think about priors, likelihoods & posteriors.\n",
+    "- PyMC3 calculates the posterior for us!\n",
+    "\n",
+    "Under the hood, PyMC3 will compute the posterior using a sampling based approach called Markov Chain Monte Carlo (MCMC) or Variational Inference. Check the [PyMC3 docs](https://docs.pymc.io/) for more on these. \n",
+    "\n",
+    "But now, it's time to bust out some MCMC and get sampling!"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Parameter estimation I: click-through rate"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A common experiment in tech data science is to test a product change and see how it affects a metric that you're interested in. Say that I don't think enough people are clicking a button on my website & I hypothesize that it's because the button is a similar color to the background of the page. Then I can set up two pages and send some people to each: the first the original page, the second a page that is identical, except that it has a button that is of higher contrast and see if more people click through. This is commonly referred to as an A/B test and the metric of interest is click-through rate (CTR), what proportion of people click through. Before even looking at two rates, let's use PyMC3 to estimate one.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First generate click-through data, given a CTR $p_a=0.15$."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -283,6 +324,17 @@
     "n_successes_a = np.sum(np.random.binomial(N,p_a))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now it's time to build your probability model. Noticing that our model of having a constant CTR resulting in click or not is a biased coin flip,\n",
+    "- the sampling distribution is binomial and we need to encode this in the likelihood;\n",
+    "- there is a single parameter $p$ that we need to describe the uncertainty around, using a prior and we'll use a uniform prior for this.\n",
+    "\n",
+    "These are the ingredients for the model so let's now build it:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -294,14 +346,43 @@
     "    # Prior on p\n",
     "    prob = pm.Uniform('p')\n",
     "    # Binomial Likelihood\n",
-    "    y = pm.Binomial('y', n=N, p=prob, observed=n_successes_a)\n",
-    "\n",
+    "    y = pm.Binomial('y', n=N, p=prob, observed=n_successes_a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Discussion:** \n",
+    "- What do you think of the API for PyMC3. Does it reflect how we think about model building?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It's now time to sample from the posterior using PyMC3. You'll also plot the posterior:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "with Model:\n",
-    "    samples = pm.sample(1000, njobs=1)\n",
+    "    samples = pm.sample(2000, njobs=1)\n",
     "\n",
     "pm.plot_posterior(samples);"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**For discussion:** Interpret the posterior ditribution. What would your tell the non-technical manager of your growth team about the CTR?"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -313,7 +394,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this exercise, you'll calculate the  posterior mean beak depth of Galapagos finches."
+    "In this exercise, you'll calculate the  posterior mean beak depth of Galapagos finches in a given species. First you'll load the data and subset wrt species:"
    ]
   },
   {
@@ -328,16 +409,38 @@
     "df_scandens = df_12.loc[df_12['species'] == 'scandens']"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To specify the full probabilty model, you need\n",
+    "- a likelihood function for the data &\n",
+    "- priors for all unknowns.\n",
+    "\n",
+    "What is the likelihood here? Let's plot the measurements below and see that they look approximately Gaussian/normal so you'll use a normal likelihood $y_i\\sim \\mathcal{N}(\\mu, \\sigma^2)$. The unknowns here are the mean $\\mu$ and standard deviation $\\sigma$ and we'll use weakly informative priors on both\n",
+    "- a normal prior for $\\mu$ with mean $10$ and standard deviation $5$;\n",
+    "- a uniform prior for $\\sigma$ bounded between $0$ and $10$.\n",
+    "\n",
+    "We can discuss biological reasons for these priors also but you can also test that the posteriors are relativelyt robust to the choice of prior here due to the amount of data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.distplot(df_fortis['blength']);"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "with pm.Model() as model:\n",
-    "    \"\"\"\n",
-    "    The priors for each group.\n",
-    "    \"\"\"\n",
+    "    # Prior for mean & standard deviation\n",
     "    μ_1 = pm.Normal('μ_1', mu=10, sd=5)\n",
     "    σ_1 = pm.Lognormal('σ_1', 0, 10)\n",
     "    # Gaussian Likelihood\n",
@@ -415,7 +518,7 @@
    "outputs": [],
    "source": [
     "with Model:\n",
-    "    samples = pm.sample(1000, njobs=1)\n",
+    "    samples = pm.sample(2000, njobs=1)\n",
     "pm.plot_posterior(samples);"
    ]
   },