Skip to content

Commit 01faef4

Browse files
committed
[skip ci] Ai improvements
1 parent 23ac1d0 commit 01faef4

2 files changed

Lines changed: 150 additions & 44 deletions

File tree

30_introduction_data_exploration.ipynb

Lines changed: 121 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@
2020
" - [Exercise reading in data](#Exercise-reading-in-data)\n",
2121
" - [Playground](#Playground)\n",
2222
" - [Data exploration](#Data-exploration)\n",
23+
" - [Guided Exploration Questions](#Guided-Exploration-Questions)\n",
24+
" - [Exercise: Explore the Dataset](#Exercise:-Explore-the-Dataset)\n",
2325
" - [Building the plot from scratch](#Building-the-plot-from-scratch)\n",
2426
" - [Finding the limits](#Finding-the-limits)\n",
2527
" - [Cleaning missing values](#Cleaning-missing-values)\n",
@@ -113,7 +115,7 @@
113115
"cell_type": "markdown",
114116
"metadata": {},
115117
"source": [
116-
"In the graph above, the size corresponds to the population of each country and the values of GDP per capital and life expectency along with the name of the country can be seen by hovering over the cursor on the bubbles.\n",
118+
"In the graph above, the size corresponds to the population of each country and the values of GDP per capita and life expectancy along with the name of the country can be seen by hovering over the cursor on the bubbles.\n",
117119
"Imagine the work you'd have to put in to build this without any libraries.\n",
118120
"\n",
119121
"This animated bubble chart can convey a great deal of information since it can accommodate up to *six variables* in total, namely:\n",
@@ -127,7 +129,7 @@
127129
"\n",
128130
"Using the function `bubbleplot` from the module [`bubbly`(bubble charts with plotly)](https://github.com/AashitaK/bubbly): see references for all source material.\n",
129131
"Our goal is to recreate this Visualization but with a different dataset.\n",
130-
"For this, we have already preloaded a data file in the folder `data/introtolibraries/World-happiness-report-updated_2024.csv` which is an open data record which can be found on kaggle.com."
132+
"For this, we have already preloaded a data file in the folder `data/data_exploration/World-happiness-report-updated_2024.csv` which is an open data record which can be found on kaggle.com."
131133
]
132134
},
133135
{
@@ -455,6 +457,83 @@
455457
" print(\"\\n\")"
456458
]
457459
},
460+
{
461+
"cell_type": "markdown",
462+
"metadata": {},
463+
"source": [
464+
"### Guided Exploration Questions\n",
465+
"\n",
466+
"Now that you've seen some basic exploration functions, try answering the following questions about the happiness dataset using the playground cell above.\n",
467+
"These are common questions you would ask when encountering a new dataset for the first time.\n",
468+
"\n",
469+
"1. **How many rows and columns** does the dataset have? (Hint: use `.shape`)\n",
470+
"2. **How many unique countries** are in the dataset? (Hint: use `.nunique()` on the right column)\n",
471+
"3. **Which column has the most missing values?** How many are missing? (Hint: use `.isna().sum()`)\n",
472+
"4. **Are there any duplicate rows?** (Hint: use `.duplicated().sum()`)\n",
473+
"5. **How many data points** does each year have? Is coverage consistent across years? (Hint: use `.groupby('year').size()`)\n",
474+
"6. **Which 5 countries** had the highest `Life Ladder` (happiness) score in 2023? (Hint: filter by year, then use `.nlargest()`)\n",
475+
"7. **What is the mean `Generosity`** across all years? Does it vary much per year? (Hint: use `.groupby('year')['Generosity'].mean()`)\n",
476+
"\n",
477+
"Feel free to experiment with other columns and functions from the reference table above. A good data exploration habit is to always check shape, missing values, duplicates, and distributions before diving into analysis."
478+
]
479+
},
480+
{
481+
"cell_type": "markdown",
482+
"metadata": {},
483+
"source": [
484+
"### Exercise: Explore the Dataset\n",
485+
"\n",
486+
"Now let's formalize some of these exploration steps into a function.\n",
487+
"Complete the function below to return a dictionary with the following keys:\n",
488+
"\n",
489+
"1. `'n_rows'` — the number of rows in the DataFrame (as an `int`).\n",
490+
"2. `'n_columns'` — the number of columns in the DataFrame (as an `int`).\n",
491+
"3. `'n_countries'` — the number of unique countries (as an `int`). Use the `'Country name'` column.\n",
492+
"4. `'n_duplicates'` — the number of duplicate rows (as an `int`).\n",
493+
"5. `'column_with_most_nans'` — the name of the column with the most missing values (as a `str`).\n",
494+
"6. `'happiest_country_2023'` — the name of the country with the highest `'Life Ladder'` score in year 2023 (as a `str`)."
495+
]
496+
},
497+
{
498+
"cell_type": "code",
499+
"execution_count": null,
500+
"metadata": {},
501+
"outputs": [],
502+
"source": [
503+
"%reload_ext tutorial.tests.testsuite"
504+
]
505+
},
506+
{
507+
"cell_type": "code",
508+
"execution_count": null,
509+
"metadata": {},
510+
"outputs": [],
511+
"source": [
512+
"%%ipytest\n",
513+
"\n",
514+
"import pandas as pd\n",
515+
"import numpy as np\n",
516+
"def solution_explore_dataset(happiness_df: pd.DataFrame) -> dict:\n",
517+
" \"\"\"Explores the happiness dataset and returns key summary statistics.\n",
518+
"\n",
519+
" 1. Count the number of rows and columns using .shape\n",
520+
" 2. Count the number of unique countries in the 'Country name' column\n",
521+
" 3. Count the number of duplicate rows\n",
522+
" 4. Find the column with the most NaN values\n",
523+
" 5. Find the happiest country in 2023 (highest 'Life Ladder')\n",
524+
"\n",
525+
" Args:\n",
526+
" happiness_df : DataFrame containing the happiness data\n",
527+
"\n",
528+
" Returns:\n",
529+
" dict with keys: 'n_rows', 'n_columns', 'n_countries', 'n_duplicates',\n",
530+
" 'column_with_most_nans', 'happiest_country_2023'\n",
531+
" \"\"\"\n",
532+
" # Your code starts here\n",
533+
" return\n",
534+
" # Your code ends here"
535+
]
536+
},
458537
{
459538
"cell_type": "markdown",
460539
"metadata": {
@@ -544,7 +623,7 @@
544623
"\n",
545624
"Pandas provides several flexible methods for handling missing data, represented as `NaN`. \n",
546625
"You can identify missing values using `.isna()` or `.isnull()`, and then choose a strategy: `.dropna()` removes rows or columns with missing values, while `.fillna()` replaces them. `.ffill` propagates the last valid observation forward to fill in the missing values.\n",
547-
"For example, `df.ffill` will replace a `NaN` with the value from the previous row which had a non-`Nan`value.\n",
626+
"For example, `df.ffill` will replace a `NaN` with the value from the previous row which had a non-`NaN` value.\n",
548627
"You can also fill it with a specific value (like the mean, median, or constant).\n",
549628
"For time series data, you might use interpolation with `.interpolate()` to fill gaps.\n",
550629
"The best approach depends on the nature of the data and the goal of your analysis.\n",
@@ -559,6 +638,33 @@
559638
"> Depending on your data and use case, one may be more appropriate than the other — or you might even combine both to fill gaps from both directions."
560639
]
561640
},
641+
{
642+
"cell_type": "markdown",
643+
"metadata": {},
644+
"source": [
645+
"Data cleaning is crucial in data analysis, but several pitfalls exist.\n",
646+
"Here's a summary of common mistakes and how to avoid them:\n",
647+
"\n",
648+
"1. Incorrectly Handling Missing Values.\n",
649+
" Replacing `NaN` values with the mean can be misleading, especially with skewed data.\n",
650+
" Consider using the median or more advanced imputation techniques, and understand the reason for missingness.\n",
651+
" Pandas tools like `fillna()`, `dropna()`, and `interpolate()` are essential here.\n",
652+
"2. Removing Outliers Without Investigation.\n",
653+
" Avoid automatically deleting outliers.\n",
654+
" Visualize the data to determine if outliers are genuine extreme values or errors.\n",
655+
" If genuine, they may be important for analysis. Use boolean indexing with summary statistics to handle them in Pandas.\n",
656+
"3. Ignoring Data Types.\n",
657+
" Ensure columns have the correct data type.\n",
658+
" Use `df.info()` to check and convert columns with `pd.to_numeric()`, `pd.to_datetime()`, or `astype()`.\n",
659+
"4. Not Handling Duplicates Carefully.\n",
660+
" Investigate the source of duplicate rows before removing them.\n",
661+
" They may indicate data entry errors or represent significant repeated measurements.\n",
662+
" Pandas provides `duplicated()` and `drop_duplicates()` for this purpose.\n",
663+
"5. Applying Transformations Incorrectly.\n",
664+
" Scaling data without considering outliers can lead to issues.\n",
665+
" If scaling is necessary, consider robust scalers (like `RobustScaler` from `scikit-learn`) that are less affected by outliers."
666+
]
667+
},
562668
{
563669
"cell_type": "markdown",
564670
"metadata": {},
@@ -614,33 +720,6 @@
614720
" # Your code ends here"
615721
]
616722
},
617-
{
618-
"cell_type": "markdown",
619-
"metadata": {},
620-
"source": [
621-
"Data cleaning is crucial in data analysis, but several pitfalls exist.\n",
622-
"Here's a summary of common mistakes and how to avoid them:\n",
623-
"\n",
624-
"1. Incorrectly Handling Missing Values.\n",
625-
" Replacing `NaN` values with the mean can be misleading, especially with skewed data.\n",
626-
" Consider using the median or more advanced imputation techniques, and understand the reason for missingness.\n",
627-
" Pandas tools like `fillna()`, `dropna()`, and `interpolate()` are essential here.\n",
628-
"2. Removing Outliers Without Investigation.\n",
629-
" Avoid automatically deleting outliers.\n",
630-
" Visualize the data to determine if outliers are genuine extreme values or errors.\n",
631-
" If genuine, they may be important for analysis. Use boolean indexing with summary statistics to handle them in Pandas.\n",
632-
"3. Ignoring Data Types.\n",
633-
" Ensure columns have the correct data type.\n",
634-
" Use `df.info()` to check and convert columns with `pd.to_numeric()`, `pd.to_datetime()`, or `astype()`.\n",
635-
"4. Not Handling Duplicates Carefully.\n",
636-
" Investigate the source of duplicate rows before removing them.\n",
637-
" They may indicate data entry errors or represent significant repeated measurements.\n",
638-
" Pandas provides `duplicated()` and `drop_duplicates()` for this purpose.\n",
639-
"5. Applying Transformations Incorrectly.\n",
640-
" Scaling data without considering outliers can lead to issues.\n",
641-
" If scaling is necessary, consider robust scalers (like `RobustScaler` from `scikit-learn`) that are less affected by outliers."
642-
]
643-
},
644723
{
645724
"cell_type": "markdown",
646725
"metadata": {},
@@ -651,8 +730,8 @@
651730
"To add a regional indicator, we'll need another dataset that maps countries to their respective regions.\n",
652731
"We can then merge this data with our main happiness dataframe based on the 'Country name' column.\n",
653732
"\n",
654-
"Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/data_exploration/` directory with columns `'Country name'` and `'Region indicator'`.\n",
655-
"Then we could merge the dataframes on the country name and get a regional indicator for all the countries.\n",
733+
"We have a CSV file named `country_region_mapping.csv` in the `data/data_exploration/` directory with columns `'Country name'` and `'Regional indicator'`.\n",
734+
"We can merge the dataframes on the country name and get a regional indicator for all the countries.\n",
656735
"\n",
657736
"Let's explore the `pd.merge` function for that.\n",
658737
"The merging of tables comes from SQL Table merges and if you are not familiar with those, for now, keep in mind we want to do a **left merge** with the happiness table being the left table and the region mapping the right table and we merge **on** a column which they have in common.\n",
@@ -1085,22 +1164,20 @@
10851164
"y_column = 'Life Ladder'\n",
10861165
"description_column = 'Country name'\n",
10871166
"time_column = 'year'\n",
1088-
"# Set the layout\n",
1089-
"figure = set_layout(x_title='Freedom to make life choices', y_title='Life Ladder',\n",
1090-
" title='Happiness Indicators', x_logscale=False, y_logscale=False, \n",
1091-
" show_slider=True, slider_scale=years, show_button=True, show_legend=False, \n",
1092-
" height=650)\n",
10931167
"\n",
10941168
"# Define the new variable\n",
10951169
"bubble_size_column = 'Resized Log GDP per capita'\n",
10961170
"category_column = 'Regional indicator'\n",
10971171
"\n",
1098-
"\n",
1099-
"\n",
11001172
"# Make the grid\n",
11011173
"years = dataset[time_column].unique()\n",
11021174
"years.sort()\n",
1103-
" \n",
1175+
"\n",
1176+
"# Set the layout\n",
1177+
"figure = set_layout(x_title='Freedom to make life choices', y_title='Life Ladder',\n",
1178+
" title='Happiness Indicators', x_logscale=False, y_logscale=False, \n",
1179+
" show_slider=True, slider_scale=years, show_button=True, show_legend=False, \n",
1180+
" height=650)\n",
11041181
"\n",
11051182
"# Add the base frame\n",
11061183
"year = min(years)\n",
@@ -1142,8 +1219,6 @@
11421219
"\n",
11431220
"\n",
11441221
"# Add time frames\n",
1145-
"years = dataset[time_column].unique()\n",
1146-
"years.sort()\n",
11471222
"figure['frames'] = [frame_by_year_with_size(dataset, year, x_column, y_column, description_column) for year in years]\n",
11481223
"\n",
11491224
"\n",
@@ -1226,10 +1301,12 @@
12261301
" year : Year to plot\n",
12271302
" x_column : Column name for x-axis\n",
12281303
" y_column : Column name for y-axis\n",
1229-
" description_column : Column name for text\n",
1304+
" description_column : Column name for hover text\n",
1305+
" category_column : Column name for the category to split traces by\n",
1306+
" bubble_size_column : Column name for the bubble size\n",
12301307
"\n",
12311308
" Returns:\n",
1232-
" - Dictionary containing the trace and frame information\n",
1309+
" - Dictionary with 'data' (list of traces per category) and 'name' (year as string)\n",
12331310
" \"\"\"\n",
12341311
" # Your code starts here\n",
12351312
" return\n",

tutorial/tests/test_30_introduction_data_exploration.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,35 @@ def test_read_in_dataframe(input_arg, function_to_test):
3535
assert happiness_df.equals(result)
3636

3737

38+
def reference_explore_dataset(happiness_df: pd.DataFrame) -> dict:
39+
n_rows, n_columns = happiness_df.shape
40+
n_countries = happiness_df["Country name"].nunique()
41+
n_duplicates = int(happiness_df.duplicated().sum())
42+
column_with_most_nans = happiness_df.isna().sum().idxmax()
43+
happiest_2023 = happiness_df[happiness_df["year"] == 2023].nlargest(1, "Life Ladder")["Country name"].iloc[0]
44+
return {
45+
"n_rows": n_rows,
46+
"n_columns": n_columns,
47+
"n_countries": n_countries,
48+
"n_duplicates": n_duplicates,
49+
"column_with_most_nans": column_with_most_nans,
50+
"happiest_country_2023": happiest_2023,
51+
}
52+
53+
54+
@pytest.mark.parametrize("input_arg", input_args)
55+
def test_explore_dataset(input_arg, function_to_test):
56+
happiness_df = reference_read_in_dataframe(
57+
"data/data_exploration/World-happiness-report-updated_2024.csv"
58+
)
59+
ref = reference_explore_dataset(happiness_df)
60+
sol = function_to_test(happiness_df)
61+
assert isinstance(sol, dict), "Your function should return a dict, but it returned None. Did you forget the return statement?"
62+
for key in ref:
63+
assert key in sol, f"Missing key '{key}' in the returned dictionary"
64+
assert sol[key] == ref[key], f"Value for '{key}' is {sol[key]}, expected {ref[key]}"
65+
66+
3867
def reference_clean_dataset(happiness_df: pd.DataFrame) -> pd.DataFrame:
3968
# Define the range of possible years
4069
min_year = happiness_df["year"].min()

0 commit comments

Comments
 (0)