Skip to content

Commit 23ac1d0

Browse files
committed
[skip ci] Fix for suggestions
- Updated the path for the happiness CSV file in the data exploration notebook to reflect the correct directory structure. - Added hints for visualizing year values in matplotlib plots. - Improved documentation for data cleaning functions, clarifying the merging process and filling strategies. - Enhanced test cases to assert the return types and structure of DataFrames in the testing suite for better validation. - Adjusted layout settings in the helper module to ensure consistent legend sizing in visualizations.
1 parent 23572e5 commit 23ac1d0

3 files changed

Lines changed: 40 additions & 17 deletions

File tree

30_introduction_data_exploration.ipynb

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@
299299
"\n",
300300
"In the cell below you should write the code that solves the first exercise:\n",
301301
"\n",
302-
" - Use the `path_to_happiness` which will be `data/plotly_intro/World-happiness-report-updated_2024.csv` which leads to a CSV file to read in\n",
302+
" - Use the `path_to_happiness` which will be `data/data_exploration/World-happiness-report-updated_2024.csv` which leads to a CSV file to read in\n",
303303
" - Read in the CSV into a dataframe and output it as `pd.DataFrame`\n",
304304
" - Because of how the `.csv`file is formated you must ensure that the encoding is latin1 `encoding='latin1'`"
305305
]
@@ -492,7 +492,13 @@
492492
"This can be done using a histogram or a line plot to visualize the frequency or trend of the time values over the range.\n",
493493
"\n",
494494
"We want to use `matplotlib.pyplot` for displaying the histogram because it has a useful function hist which does exactly that. \n",
495-
"To do that we use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram."
495+
"To do that we use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram.\n",
496+
"\n",
497+
"> **Hint:** If the x-axis displays year values as floats (e.g. 2010.0), you can force integer ticks with:\n",
498+
"> ```python\n",
499+
"> from matplotlib.ticker import MaxNLocator\n",
500+
"> plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))\n",
501+
"> ```"
496502
]
497503
},
498504
{
@@ -522,6 +528,11 @@
522528
"plt.xlabel('Year')\n",
523529
"plt.ylabel('Frequency')\n",
524530
"plt.grid(True)\n",
531+
"\n",
532+
"# Hint: If the x-axis shows float values for years, you can force integer ticks:\n",
533+
"# from matplotlib.ticker import MaxNLocator\n",
534+
"# plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))\n",
535+
"\n",
525536
"plt.show()"
526537
]
527538
},
@@ -541,7 +552,11 @@
541552
"For this step, we will try to forwardfill the dataframe:\n",
542553
"```python\n",
543554
" cleaned_happiness = cleaned_happiness.sort_values(by=['Country name', 'year']).ffill()\n",
544-
"```"
555+
"```\n",
556+
"\n",
557+
"> **Hint:** Pandas also provides `.bfill()` (backward fill), which propagates the *next* valid observation backward.\n",
558+
"> While `.ffill()` fills a `NaN` with the value from the previous row, `.bfill()` fills it with the value from the next row.\n",
559+
"> Depending on your data and use case, one may be more appropriate than the other — or you might even combine both to fill gaps from both directions."
545560
]
546561
},
547562
{
@@ -554,8 +569,8 @@
554569
"Complete the function below to:\n",
555570
"\n",
556571
"1. Fill in missing years for every country (so we have an entry for every year between 2005 and 2023 and every country).\n",
557-
" Do this by initializing a DataFrame with `pd.DataFrame()` with a list.\n",
558-
" Then left-merge the happiness dataframe to it with `pd.merge()`.\n",
572+
" First, use a list comprehension to create a list of `(country, year)` tuples for all country/year combinations. Then pass that list to `pd.DataFrame()` with `columns=['Country name', 'year']` to build a complete scaffold DataFrame.\n",
573+
" Finally, use the `.merge()` method on this scaffold to left-merge the original happiness data onto it (matching on `'Country name'` and `'year'`). This ensures every country has a row for every year, with `NaN` where data was missing.\n",
559574
"2. Fill all missing values in the year 2005 with the value 1.\n",
560575
" Use the `.fillna()` function.\n",
561576
"3. Forwardfill all the remaining years with the function `.ffill()`.\n",
@@ -584,7 +599,7 @@
584599
"def solution_clean_dataset(happiness_df: pd.DataFrame) -> pd.DataFrame:\n",
585600
" \"\"\"Cleans the dataset by adding missing year and country values\n",
586601
"\n",
587-
" 1. Add in missing years for every country\n",
602+
" 1. Create a DataFrame with all (country, year) combinations, then left-merge the happiness data onto it\n",
588603
" 2. Fill the minimum year with values of 1\n",
589604
" 3. Forward fill the rest of the years\n",
590605
"\n",
@@ -636,7 +651,7 @@
636651
"To add a regional indicator, we'll need another dataset that maps countries to their respective regions.\n",
637652
"We can then merge this data with our main happiness dataframe based on the 'Country name' column.\n",
638653
"\n",
639-
"Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/plotly_intro/` directory with columns `'Country name'` and `'Region indicator'`.\n",
654+
"Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/data_exploration/` directory with columns `'Country name'` and `'Region indicator'`.\n",
640655
"Then we could merge the dataframes on the country name and get a regional indicator for all the countries.\n",
641656
"\n",
642657
"Let's explore the `pd.merge` function for that.\n",
@@ -698,8 +713,8 @@
698713
"def solution_add_regional_indicator(cleaned_happiness_df: pd.DataFrame, region_df: pd.DataFrame) -> pd.DataFrame:\n",
699714
" \"\"\"Adds a regional indicator to the dataset\n",
700715
"\n",
701-
" 1. Merge the cleaned_happiness_df with region_df on the 'Country name' and 'year' columns\n",
702-
" 2. Fill the missing values in the 'Region indicator' column with 'Unknown'\n",
716+
" 1. Merge the cleaned_happiness_df with region_df on the 'Country name' column\n",
717+
" 2. Fill the missing values in the 'Regional indicator' column with 'Unknown'\n",
703718
"\n",
704719
" Args:\n",
705720
" cleaned_happiness_df : DataFrame containing the happiness data\n",
@@ -773,7 +788,6 @@
773788
"x_column = 'Freedom to make life choices'\n",
774789
"y_column = 'Life Ladder'\n",
775790
"description_column = 'Country name'\n",
776-
"# time_column = 'year'\n",
777791
"\n",
778792
"\n",
779793
"\n",
@@ -831,7 +845,6 @@
831845
"x_column = 'Freedom to make life choices'\n",
832846
"y_column = 'Life Ladder'\n",
833847
"description_column = 'Country name'\n",
834-
"# time_column = 'year'\n",
835848
"figure = get_scatter_figure(dataset, x_column, y_column, description_column)\n",
836849
"\n",
837850
"def frame_by_year(dataset, year, x_column, y_column, description_column):\n",
@@ -1050,7 +1063,7 @@
10501063
"# append to final_happiness_df\n",
10511064
"dataset = pd.merge(complete_happiness_df, resized_log_gdp_df, on=['Country name', 'year'], how='left')\n",
10521065
"# Check out year 2010\n",
1053-
"dataset[dataset['year'] == 2010].head(10)\n"
1066+
"dataset[dataset['year'] == 2010].head(10)"
10541067
]
10551068
},
10561069
{
@@ -1129,6 +1142,8 @@
11291142
"\n",
11301143
"\n",
11311144
"# Add time frames\n",
1145+
"years = dataset[time_column].unique()\n",
1146+
"years.sort()\n",
11321147
"figure['frames'] = [frame_by_year_with_size(dataset, year, x_column, y_column, description_column) for year in years]\n",
11331148
"\n",
11341149
"\n",
@@ -1267,13 +1282,13 @@
12671282
"However, this might not be feasible if you need it in other places and is generally not the best solution. \n",
12681283
"Last but not least you can try to fix it yourself.\n",
12691284
"\n",
1270-
"So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data.plotly_intro`.\n",
1285+
"So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data/data_exploration`.\n",
12711286
"It is quite a short library so quite managable.\n",
1272-
"Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/plotly_intro/bubbly.py` until the same code below compiles.\n",
1287+
"Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/data_exploration/bubbly.py` until the same code below compiles.\n",
12731288
"\n",
12741289
"Note: You will need to restart the kernel after applying changes to the packages.\n",
12751290
"\n",
1276-
"(If you are interested in a solution, we have a fixed version under `tutorial.my_bubbly.py`, feel free to check the differences.)"
1291+
"(If you are interested in a solution, we have a fixed version under `tutorial/my_bubbly.py`, feel free to check the differences.)"
12771292
]
12781293
},
12791294
{

tutorial/data_exploration_helper.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -380,6 +380,7 @@ def set_layout(
380380
figure["layout"]["title"] = title
381381
figure["layout"]["hovermode"] = "closest"
382382
figure["layout"]["showlegend"] = show_legend
383+
figure["layout"]["legend"] = {"itemsizing": "constant"}
383384
figure["layout"]["margin"] = {"b": 50, "t": 50, "pad": 5}
384385

385386
if width:

tutorial/tests/test_30_introduction_data_exploration.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,10 @@ def test_read_in_dataframe(input_arg, function_to_test):
2929
# Read in the data
3030
happiness_df = reference_read_in_dataframe(path_to_happiness)
3131

32+
result = function_to_test(path_to_happiness)
33+
assert isinstance(result, pd.DataFrame), "Your function should return a pd.DataFrame, but it returned None. Did you forget the return statement?"
3234
# Check if the two DataFrames are equal
33-
assert happiness_df.equals(function_to_test(path_to_happiness))
35+
assert happiness_df.equals(result)
3436

3537

3638
def reference_clean_dataset(happiness_df: pd.DataFrame) -> pd.DataFrame:
@@ -71,6 +73,8 @@ def test_clean_dataset(input_arg, function_to_test):
7173

7274
clean_ref = reference_clean_dataset(hapiness_df)
7375
clean_sol = function_to_test(hapiness_df)
76+
assert isinstance(clean_sol, pd.DataFrame), "Your function should return a pd.DataFrame, but it returned None. Did you forget the return statement?"
77+
assert "Country name" in clean_sol.columns and "year" in clean_sol.columns, "The output should contain 'Country name' and 'year' columns"
7478
clean_ref_sorted = clean_ref.sort_values(by=["Country name", "year"]).reset_index(
7579
drop=True
7680
)
@@ -126,6 +130,7 @@ def test_add_regional_indicator(input_arg, function_to_test):
126130
clean_ref = reference_add_regional_indicator(cleaned_happiness_df, region_df)
127131
clean_sol = function_to_test(cleaned_happiness_df, region_df)
128132

133+
assert isinstance(clean_sol, pd.DataFrame), "Your function should return a pd.DataFrame, but it returned None. Did you forget the return statement?"
129134
# Check if the two DataFrames are equal
130135
assert clean_ref.equals(clean_sol)
131136

@@ -237,7 +242,9 @@ def test_frames_with_category(input_arg, function_to_test):
237242
bubble_size_column,
238243
)
239244

240-
# Check if the two DataFrames are equal
245+
assert isinstance(clean_sol, dict), "Your function should return a dict, but it returned None. Did you forget the return statement?"
246+
assert "data" in clean_sol and "name" in clean_sol, "The returned dict should have 'data' and 'name' keys"
247+
# Check if the two are equal
241248
assert clean_ref == clean_sol
242249

243250
from plotly.offline import iplot

0 commit comments

Comments
 (0)