[skip ci] Fix for suggestions

Snowwpanda · Snowwpanda · commit 23ac1d05471e · 2026-04-14T22:11:06.000+02:00
- Updated the path for the happiness CSV file in the data exploration notebook to reflect the correct directory structure.
- Added hints for visualizing year values in matplotlib plots.
- Improved documentation for data cleaning functions, clarifying the merging process and filling strategies.
- Enhanced test cases to assert the return types and structure of DataFrames in the testing suite for better validation.
- Adjusted layout settings in the helper module to ensure consistent legend sizing in visualizations.
diff --git a/30_introduction_data_exploration.ipynb b/30_introduction_data_exploration.ipynb
@@ -299,7 +299,7 @@
     "\n",
     "In the cell below you should write the code that solves the first exercise:\n",
     "\n",
-    "  -  Use the `path_to_happiness` which will be `data/plotly_intro/World-happiness-report-updated_2024.csv` which leads to a CSV file to read in\n",
+    "  -  Use the `path_to_happiness` which will be `data/data_exploration/World-happiness-report-updated_2024.csv` which leads to a CSV file to read in\n",
     "  -  Read in the CSV into a dataframe and output it as `pd.DataFrame`\n",
     "  -  Because of how the `.csv`file is formated you must ensure that the encoding is latin1 `encoding='latin1'`"
    ]
@@ -492,7 +492,13 @@
     "This can be done using a histogram or a line plot to visualize the frequency or trend of the time values over the range.\n",
     "\n",
     "We want to use `matplotlib.pyplot` for displaying the histogram because it has a useful function hist which does exactly that. \n",
-    "To do that we use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram."
+    "To do that we use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram.\n",
+    "\n",
+    "> **Hint:** If the x-axis displays year values as floats (e.g. 2010.0), you can force integer ticks with:\n",
+    "> ```python\n",
+    "> from matplotlib.ticker import MaxNLocator\n",
+    "> plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))\n",
+    "> ```"
    ]
   },
   {
@@ -522,6 +528,11 @@
     "plt.xlabel('Year')\n",
     "plt.ylabel('Frequency')\n",
     "plt.grid(True)\n",
+    "\n",
+    "# Hint: If the x-axis shows float values for years, you can force integer ticks:\n",
+    "# from matplotlib.ticker import MaxNLocator\n",
+    "# plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))\n",
+    "\n",
     "plt.show()"
    ]
   },
@@ -541,7 +552,11 @@
     "For this step, we will try to forwardfill the dataframe:\n",
     "```python\n",
     "    cleaned_happiness = cleaned_happiness.sort_values(by=['Country name', 'year']).ffill()\n",
-    "```"
+    "```\n",
+    "\n",
+    "> **Hint:** Pandas also provides `.bfill()` (backward fill), which propagates the *next* valid observation backward.\n",
+    "> While `.ffill()` fills a `NaN` with the value from the previous row, `.bfill()` fills it with the value from the next row.\n",
+    "> Depending on your data and use case, one may be more appropriate than the other — or you might even combine both to fill gaps from both directions."
    ]
   },
   {
@@ -554,8 +569,8 @@
     "Complete the function below to:\n",
     "\n",
     "1. Fill in missing years for every country (so we have an entry for every year between 2005 and 2023 and every country).\n",
-    "   Do this by initializing a DataFrame with `pd.DataFrame()` with a list.\n",
-    "   Then left-merge the happiness dataframe to it with `pd.merge()`.\n",
+    "   First, use a list comprehension to create a list of `(country, year)` tuples for all country/year combinations. Then pass that list to `pd.DataFrame()` with `columns=['Country name', 'year']` to build a complete scaffold DataFrame.\n",
+    "   Finally, use the `.merge()` method on this scaffold to left-merge the original happiness data onto it (matching on `'Country name'` and `'year'`). This ensures every country has a row for every year, with `NaN` where data was missing.\n",
     "2. Fill all missing values in the year 2005 with the value 1.\n",
     "   Use the `.fillna()` function.\n",
     "3. Forwardfill all the remaining years with the function `.ffill()`.\n",
@@ -584,7 +599,7 @@
     "def solution_clean_dataset(happiness_df: pd.DataFrame) -> pd.DataFrame:\n",
     "    \"\"\"Cleans the dataset by adding missing year and country values\n",
     "\n",
-    "        1. Add in missing years for every country\n",
+    "        1. Create a DataFrame with all (country, year) combinations, then left-merge the happiness data onto it\n",
     "        2. Fill the minimum year with values of 1\n",
     "        3. Forward fill the rest of the years\n",
     "\n",
@@ -636,7 +651,7 @@
     "To add a regional indicator, we'll need another dataset that maps countries to their respective regions.\n",
     "We can then merge this data with our main happiness dataframe based on the 'Country name' column.\n",
     "\n",
-    "Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/plotly_intro/` directory with columns `'Country name'` and `'Region indicator'`.\n",
+    "Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/data_exploration/` directory with columns `'Country name'` and `'Region indicator'`.\n",
     "Then we could merge the dataframes on the country name and get a regional indicator for all the countries.\n",
     "\n",
     "Let's explore the `pd.merge` function for that.\n",
@@ -698,8 +713,8 @@
     "def solution_add_regional_indicator(cleaned_happiness_df: pd.DataFrame, region_df: pd.DataFrame) -> pd.DataFrame:\n",
     "    \"\"\"Adds a regional indicator to the dataset\n",
     "\n",
-    "        1. Merge the cleaned_happiness_df with region_df on the 'Country name' and 'year' columns\n",
-    "        2. Fill the missing values in the 'Region indicator' column with 'Unknown'\n",
+    "        1. Merge the cleaned_happiness_df with region_df on the 'Country name' column\n",
+    "        2. Fill the missing values in the 'Regional indicator' column with 'Unknown'\n",
     "\n",
     "    Args:\n",
     "        cleaned_happiness_df : DataFrame containing the happiness data\n",
@@ -773,7 +788,6 @@
     "x_column = 'Freedom to make life choices'\n",
     "y_column = 'Life Ladder'\n",
     "description_column = 'Country name'\n",
-    "# time_column = 'year'\n",
     "\n",
     "\n",
     "\n",
@@ -831,7 +845,6 @@
     "x_column = 'Freedom to make life choices'\n",
     "y_column = 'Life Ladder'\n",
     "description_column = 'Country name'\n",
-    "# time_column = 'year'\n",
     "figure = get_scatter_figure(dataset, x_column, y_column, description_column)\n",
     "\n",
     "def frame_by_year(dataset, year, x_column, y_column, description_column):\n",
@@ -1050,7 +1063,7 @@
     "# append to final_happiness_df\n",
     "dataset = pd.merge(complete_happiness_df, resized_log_gdp_df, on=['Country name', 'year'], how='left')\n",
     "# Check out year 2010\n",
-    "dataset[dataset['year'] == 2010].head(10)\n"
+    "dataset[dataset['year'] == 2010].head(10)"
    ]
   },
   {
@@ -1129,6 +1142,8 @@
     "\n",
     "\n",
     "# Add time frames\n",
+    "years = dataset[time_column].unique()\n",
+    "years.sort()\n",
     "figure['frames'] = [frame_by_year_with_size(dataset, year, x_column, y_column, description_column) for year in years]\n",
     "\n",
     "\n",
@@ -1267,13 +1282,13 @@
     "However, this might not be feasible if you need it in other places and is generally not the best solution. \n",
     "Last but not least you can try to fix it yourself.\n",
     "\n",
-    "So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data.plotly_intro`.\n",
+    "So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data/data_exploration`.\n",
     "It is quite a short library so quite managable.\n",
-    "Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/plotly_intro/bubbly.py` until the same code below compiles.\n",
+    "Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/data_exploration/bubbly.py` until the same code below compiles.\n",
     "\n",
     "Note: You will need to restart the kernel after applying changes to the packages.\n",
     "\n",
-    "(If you are interested in a solution, we have a fixed version under `tutorial.my_bubbly.py`, feel free to check the differences.)"
+    "(If you are interested in a solution, we have a fixed version under `tutorial/my_bubbly.py`, feel free to check the differences.)"
    ]
   },
   {
diff --git a/tutorial/data_exploration_helper.py b/tutorial/data_exploration_helper.py
@@ -380,6 +380,7 @@ def set_layout(
     figure["layout"]["title"] = title
     figure["layout"]["hovermode"] = "closest"
     figure["layout"]["showlegend"] = show_legend
+    figure["layout"]["legend"] = {"itemsizing": "constant"}
     figure["layout"]["margin"] = {"b": 50, "t": 50, "pad": 5}
 
     if width:
diff --git a/tutorial/tests/test_30_introduction_data_exploration.py b/tutorial/tests/test_30_introduction_data_exploration.py
@@ -29,8 +29,10 @@ def test_read_in_dataframe(input_arg, function_to_test):
     # Read in the data
     happiness_df = reference_read_in_dataframe(path_to_happiness)
 
+    result = function_to_test(path_to_happiness)
+    assert isinstance(result, pd.DataFrame), "Your function should return a pd.DataFrame, but it returned None. Did you forget the return statement?"
     # Check if the two DataFrames are equal
-    assert happiness_df.equals(function_to_test(path_to_happiness))
+    assert happiness_df.equals(result)
 
 
 def reference_clean_dataset(happiness_df: pd.DataFrame) -> pd.DataFrame:
@@ -71,6 +73,8 @@ def test_clean_dataset(input_arg, function_to_test):
 
     clean_ref = reference_clean_dataset(hapiness_df)
     clean_sol = function_to_test(hapiness_df)
+    assert isinstance(clean_sol, pd.DataFrame), "Your function should return a pd.DataFrame, but it returned None. Did you forget the return statement?"
+    assert "Country name" in clean_sol.columns and "year" in clean_sol.columns, "The output should contain 'Country name' and 'year' columns"
     clean_ref_sorted = clean_ref.sort_values(by=["Country name", "year"]).reset_index(
         drop=True
     )
@@ -126,6 +130,7 @@ def test_add_regional_indicator(input_arg, function_to_test):
     clean_ref = reference_add_regional_indicator(cleaned_happiness_df, region_df)
     clean_sol = function_to_test(cleaned_happiness_df, region_df)
 
+    assert isinstance(clean_sol, pd.DataFrame), "Your function should return a pd.DataFrame, but it returned None. Did you forget the return statement?"
     # Check if the two DataFrames are equal
     assert clean_ref.equals(clean_sol)
 
@@ -237,7 +242,9 @@ def test_frames_with_category(input_arg, function_to_test):
         bubble_size_column,
     )
 
-    # Check if the two DataFrames are equal
+    assert isinstance(clean_sol, dict), "Your function should return a dict, but it returned None. Did you forget the return statement?"
+    assert "data" in clean_sol and "name" in clean_sol, "The returned dict should have 'data' and 'name' keys"
+    # Check if the two are equal
     assert clean_ref == clean_sol
 
     from plotly.offline import iplot