[FINALIZED] notebook 5 is ready for prime time

ericmjl · ericmjl · commit 8f0c4b7d4d20 · 2018-07-07T16:31:10.000-04:00
diff --git a/notebooks/05-instructor-two-group-comparison-finches.ipynb b/notebooks/05-instructor-two-group-comparison-finches.ipynb
@@ -12,6 +12,7 @@
     "import matplotlib.pyplot as plt\n",
     "import seaborn as sns\n",
     "import numpy as np\n",
+    "from utils import ECDF\n",
     "\n",
     "%load_ext autoreload\n",
     "%autoreload 2\n",
@@ -37,9 +38,7 @@
     "\n",
     "[data]: (https://datadryad.org/resource/doi:10.5061/dryad.g6g3h). \n",
     "\n",
-    "### Exercise\n",
-    "\n",
-    "Load the data."
+    "Let's get started and load the data."
    ]
   },
   {
@@ -91,6 +90,19 @@
     "unknown_filter = df['species'] == 'unknown'"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "Recreate the estimation model for finch beak depths. A few things to note:\n",
+    "\n",
+    "- Practice using numpy-like fancy indexing.\n",
+    "- Difference of means & effect size are optional.\n",
+    "- Feel free to play around with other priors."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -237,13 +249,13 @@
     "# Extract just the fortis samples.\n",
     "fortis_samples = samples['likelihood'][:, df[fortis_filter].index].flatten()\n",
     "# Compute the ECDF for the fortis samples.\n",
-    "x_s, y_s = ecdf(fortis_samples)\n",
+    "x_s, y_s = ECDF(fortis_samples)\n",
     "ax_fortis.plot(x_s, y_s, label='samples')\n",
     "\n",
     "# Extract just the fortis measurements.\n",
     "fortis_data = df[fortis_filter]['beak_depth']\n",
     "# Compute the ECDF for the fortis samples\n",
-    "x, y = ecdf(fortis_data)\n",
+    "x, y = ECDF(fortis_data)\n",
     "ax_fortis.plot(x, y, label='data')\n",
     "\n",
     "ax_fortis.legend()\n",
@@ -252,13 +264,13 @@
     "# Extract just the scandens samples.\n",
     "scandens_samples = samples['likelihood'][:, df[scandens_filter].index].flatten()\n",
     "# Compute the ECDF for the scandens samples\n",
-    "x_s, y_s = ecdf(scandens_samples)\n",
+    "x_s, y_s = ECDF(scandens_samples)\n",
     "ax_scandens.plot(x_s, y_s, label='samples')\n",
     "\n",
     "# Extract just the scandens measurements.\n",
     "scandens_data = df[scandens_filter]['beak_depth']\n",
     "# Compute the ECDF for the scanens samples\n",
-    "x, y = ecdf(scandens_data)\n",
+    "x, y = ECDF(scandens_data)\n",
     "\n",
     "ax_scandens.plot(x, y, label='data')\n",
     "ax_scandens.legend()\n",
diff --git a/notebooks/05-student-two-group-comparison-finches.ipynb b/notebooks/05-student-two-group-comparison-finches.ipynb
@@ -0,0 +1,306 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import janitor as jn\n",
+    "import pymc3 as pm\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "import numpy as np\n",
+    "from utils import ECDF\n",
+    "\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2\n",
+    "%matplotlib inline\n",
+    "%config InlineBackend.figure_format = 'retina'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Darwin's Finches\n",
+    "\n",
+    "A research group has taken measurements of the descendants of the finches that Charles Darwin observed when he postulated the theory of evolution.\n",
+    "\n",
+    "We will be using Bayesian methods to analyze this data, specifically answering the question of how quantitatively different two species of birds' beaks are.\n",
+    "\n",
+    "## Data Credits\n",
+    "\n",
+    "The Darwin's finches datasets come from the paper, [40 years of evolution. Darwin's finches on Daphne Major Island][data]. \n",
+    "\n",
+    "One row of data has been added for pedagogical purposes.\n",
+    "\n",
+    "[data]: (https://datadryad.org/resource/doi:10.5061/dryad.g6g3h). \n",
+    "\n",
+    "Let's get started and load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from data import load_finches_2012\n",
+    "df = load_finches_2012()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "View a random sample of the data to get a feel for the structure of the dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Your code below\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note:** I have added one row of data, simulating the discovery of an \"unknown\" species of finch for which beak measurements have been taken.\n",
+    "\n",
+    "For pedagogical brevity, we will analyze only beak depth during the class. However, I would encourage you to perform a similar analysis for beak length as well."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# These are filters that we can use later on.\n",
+    "fortis_filter = df['species'] == 'fortis'\n",
+    "scandens_filter = df['species'] == 'scandens'\n",
+    "unknown_filter = df['species'] == 'unknown'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "Recreate the estimation model for finch beak depths. A few things to note:\n",
+    "\n",
+    "- Practice using numpy-like fancy indexing.\n",
+    "- Difference of means & effect size are optional."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with pm.Model() as beak_depth_model:\n",
+    "\n",
+    "    # Your model defined here.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "Perform MCMC sampling to estimate the posterior distribution of each parameter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Your code below.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "Diagnose whether the sampling has converged or not using trace plots."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Your code below.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "Visualize the posterior distribution over the parameters using the forest plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Your code below.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "Visualize the posterior distribution of the means. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Your code below.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Discuss\n",
+    "\n",
+    "- Is the posterior distribution of beaks for the unknown species reasonable?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exericse\n",
+    "\n",
+    "Perform a posterior predictive check to visually diagnose whether the model describes the data generating process well or not."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "samples = pm.sample_ppc(trace, model=beak_depth_model, samples=2000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Hint: Each column in the samples (key: \"likelihood\") corresponds to simulated measurements of each finch in the dataset. We can use fancy indexing along the columns (axis 1) to select out simulated measurements for each category, and then flatten the resultant array to get the full estimated distribution of values for each class."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig = plt.figure()\n",
+    "ax_fortis = fig.add_subplot(2, 1, 1)\n",
+    "ax_scandens = fig.add_subplot(2, 1, 2, sharex=ax_fortis)\n",
+    "\n",
+    "# Extract just the fortis samples.\n",
+    "\n",
+    "# Compute the ECDF for the fortis samples.\n",
+    "\n",
+    "ax_fortis.plot(x_s, y_s, label='samples')\n",
+    "\n",
+    "# Extract just the fortis measurements.\n",
+    "\n",
+    "# Compute the ECDF for the fortis measurements.\n",
+    "\n",
+    "ax_fortis.plot(x, y, label='data')\n",
+    "\n",
+    "ax_fortis.legend()\n",
+    "ax_fortis.set_title('fortis')\n",
+    "\n",
+    "# Extract just the scandens samples.\n",
+    "\n",
+    "# Compute the ECDF for the scandens samples\n",
+    "\n",
+    "ax_scandens.plot(x_s, y_s, label='samples')\n",
+    "\n",
+    "# Extract just the scandens measurements.\n",
+    "\n",
+    "# Compute the ECDF for the scanens measurements.\n",
+    "\n",
+    "\n",
+    "ax_scandens.plot(x, y, label='data')\n",
+    "ax_scandens.legend()\n",
+    "ax_scandens.set_title('scandens')\n",
+    "ax_scandens.set_xlabel('beak length')\n",
+    "\n",
+    "sns.despine()\n",
+    "plt.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "1. NumPy-like fancy indexing lets us write models in a concise fashion.\n",
+    "1. Posterior estimates can show up as being \"unreasonable\", \"absurd\", or at the minimum, counter-intuitive, if we do not impose the right set of assumptions on the model.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "bayesian-modelling-tutorial",
+   "language": "python",
+   "name": "bayesian-modelling-tutorial"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}