|
299 | 299 | "\n", |
300 | 300 | "In the cell below you should write the code that solves the first exercise:\n", |
301 | 301 | "\n", |
302 | | - " - Use the `path_to_happiness` which will be `data/plotly_intro/World-happiness-report-updated_2024.csv` which leads to a CSV file to read in\n", |
| 302 | + " - Use the `path_to_happiness` which will be `data/data_exploration/World-happiness-report-updated_2024.csv` which leads to a CSV file to read in\n", |
303 | 303 | " - Read in the CSV into a dataframe and output it as `pd.DataFrame`\n", |
304 | 304 | " - Because of how the `.csv`file is formated you must ensure that the encoding is latin1 `encoding='latin1'`" |
305 | 305 | ] |
|
492 | 492 | "This can be done using a histogram or a line plot to visualize the frequency or trend of the time values over the range.\n", |
493 | 493 | "\n", |
494 | 494 | "We want to use `matplotlib.pyplot` for displaying the histogram because it has a useful function hist which does exactly that. \n", |
495 | | - "To do that we use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram." |
| 495 | + "To do that we use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram.\n", |
| 496 | + "\n", |
| 497 | + "> **Hint:** If the x-axis displays year values as floats (e.g. 2010.0), you can force integer ticks with:\n", |
| 498 | + "> ```python\n", |
| 499 | + "> from matplotlib.ticker import MaxNLocator\n", |
| 500 | + "> plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))\n", |
| 501 | + "> ```" |
496 | 502 | ] |
497 | 503 | }, |
498 | 504 | { |
|
522 | 528 | "plt.xlabel('Year')\n", |
523 | 529 | "plt.ylabel('Frequency')\n", |
524 | 530 | "plt.grid(True)\n", |
| 531 | + "\n", |
| 532 | + "# Hint: If the x-axis shows float values for years, you can force integer ticks:\n", |
| 533 | + "# from matplotlib.ticker import MaxNLocator\n", |
| 534 | + "# plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))\n", |
| 535 | + "\n", |
525 | 536 | "plt.show()" |
526 | 537 | ] |
527 | 538 | }, |
|
541 | 552 | "For this step, we will try to forwardfill the dataframe:\n", |
542 | 553 | "```python\n", |
543 | 554 | " cleaned_happiness = cleaned_happiness.sort_values(by=['Country name', 'year']).ffill()\n", |
544 | | - "```" |
| 555 | + "```\n", |
| 556 | + "\n", |
| 557 | + "> **Hint:** Pandas also provides `.bfill()` (backward fill), which propagates the *next* valid observation backward.\n", |
| 558 | + "> While `.ffill()` fills a `NaN` with the value from the previous row, `.bfill()` fills it with the value from the next row.\n", |
| 559 | + "> Depending on your data and use case, one may be more appropriate than the other — or you might even combine both to fill gaps from both directions." |
545 | 560 | ] |
546 | 561 | }, |
547 | 562 | { |
|
554 | 569 | "Complete the function below to:\n", |
555 | 570 | "\n", |
556 | 571 | "1. Fill in missing years for every country (so we have an entry for every year between 2005 and 2023 and every country).\n", |
557 | | - " Do this by initializing a DataFrame with `pd.DataFrame()` with a list.\n", |
558 | | - " Then left-merge the happiness dataframe to it with `pd.merge()`.\n", |
| 572 | + " First, use a list comprehension to create a list of `(country, year)` tuples for all country/year combinations. Then pass that list to `pd.DataFrame()` with `columns=['Country name', 'year']` to build a complete scaffold DataFrame.\n", |
| 573 | + " Finally, use the `.merge()` method on this scaffold to left-merge the original happiness data onto it (matching on `'Country name'` and `'year'`). This ensures every country has a row for every year, with `NaN` where data was missing.\n", |
559 | 574 | "2. Fill all missing values in the year 2005 with the value 1.\n", |
560 | 575 | " Use the `.fillna()` function.\n", |
561 | 576 | "3. Forwardfill all the remaining years with the function `.ffill()`.\n", |
|
584 | 599 | "def solution_clean_dataset(happiness_df: pd.DataFrame) -> pd.DataFrame:\n", |
585 | 600 | " \"\"\"Cleans the dataset by adding missing year and country values\n", |
586 | 601 | "\n", |
587 | | - " 1. Add in missing years for every country\n", |
| 602 | + " 1. Create a DataFrame with all (country, year) combinations, then left-merge the happiness data onto it\n", |
588 | 603 | " 2. Fill the minimum year with values of 1\n", |
589 | 604 | " 3. Forward fill the rest of the years\n", |
590 | 605 | "\n", |
|
636 | 651 | "To add a regional indicator, we'll need another dataset that maps countries to their respective regions.\n", |
637 | 652 | "We can then merge this data with our main happiness dataframe based on the 'Country name' column.\n", |
638 | 653 | "\n", |
639 | | - "Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/plotly_intro/` directory with columns `'Country name'` and `'Region indicator'`.\n", |
| 654 | + "Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/data_exploration/` directory with columns `'Country name'` and `'Region indicator'`.\n", |
640 | 655 | "Then we could merge the dataframes on the country name and get a regional indicator for all the countries.\n", |
641 | 656 | "\n", |
642 | 657 | "Let's explore the `pd.merge` function for that.\n", |
|
698 | 713 | "def solution_add_regional_indicator(cleaned_happiness_df: pd.DataFrame, region_df: pd.DataFrame) -> pd.DataFrame:\n", |
699 | 714 | " \"\"\"Adds a regional indicator to the dataset\n", |
700 | 715 | "\n", |
701 | | - " 1. Merge the cleaned_happiness_df with region_df on the 'Country name' and 'year' columns\n", |
702 | | - " 2. Fill the missing values in the 'Region indicator' column with 'Unknown'\n", |
| 716 | + " 1. Merge the cleaned_happiness_df with region_df on the 'Country name' column\n", |
| 717 | + " 2. Fill the missing values in the 'Regional indicator' column with 'Unknown'\n", |
703 | 718 | "\n", |
704 | 719 | " Args:\n", |
705 | 720 | " cleaned_happiness_df : DataFrame containing the happiness data\n", |
|
773 | 788 | "x_column = 'Freedom to make life choices'\n", |
774 | 789 | "y_column = 'Life Ladder'\n", |
775 | 790 | "description_column = 'Country name'\n", |
776 | | - "# time_column = 'year'\n", |
777 | 791 | "\n", |
778 | 792 | "\n", |
779 | 793 | "\n", |
|
831 | 845 | "x_column = 'Freedom to make life choices'\n", |
832 | 846 | "y_column = 'Life Ladder'\n", |
833 | 847 | "description_column = 'Country name'\n", |
834 | | - "# time_column = 'year'\n", |
835 | 848 | "figure = get_scatter_figure(dataset, x_column, y_column, description_column)\n", |
836 | 849 | "\n", |
837 | 850 | "def frame_by_year(dataset, year, x_column, y_column, description_column):\n", |
|
1050 | 1063 | "# append to final_happiness_df\n", |
1051 | 1064 | "dataset = pd.merge(complete_happiness_df, resized_log_gdp_df, on=['Country name', 'year'], how='left')\n", |
1052 | 1065 | "# Check out year 2010\n", |
1053 | | - "dataset[dataset['year'] == 2010].head(10)\n" |
| 1066 | + "dataset[dataset['year'] == 2010].head(10)" |
1054 | 1067 | ] |
1055 | 1068 | }, |
1056 | 1069 | { |
|
1129 | 1142 | "\n", |
1130 | 1143 | "\n", |
1131 | 1144 | "# Add time frames\n", |
| 1145 | + "years = dataset[time_column].unique()\n", |
| 1146 | + "years.sort()\n", |
1132 | 1147 | "figure['frames'] = [frame_by_year_with_size(dataset, year, x_column, y_column, description_column) for year in years]\n", |
1133 | 1148 | "\n", |
1134 | 1149 | "\n", |
|
1267 | 1282 | "However, this might not be feasible if you need it in other places and is generally not the best solution. \n", |
1268 | 1283 | "Last but not least you can try to fix it yourself.\n", |
1269 | 1284 | "\n", |
1270 | | - "So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data.plotly_intro`.\n", |
| 1285 | + "So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data/data_exploration`.\n", |
1271 | 1286 | "It is quite a short library so quite managable.\n", |
1272 | | - "Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/plotly_intro/bubbly.py` until the same code below compiles.\n", |
| 1287 | + "Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/data_exploration/bubbly.py` until the same code below compiles.\n", |
1273 | 1288 | "\n", |
1274 | 1289 | "Note: You will need to restart the kernel after applying changes to the packages.\n", |
1275 | 1290 | "\n", |
1276 | | - "(If you are interested in a solution, we have a fixed version under `tutorial.my_bubbly.py`, feel free to check the differences.)" |
| 1291 | + "(If you are interested in a solution, we have a fixed version under `tutorial/my_bubbly.py`, feel free to check the differences.)" |
1277 | 1292 | ] |
1278 | 1293 | }, |
1279 | 1294 | { |
|
0 commit comments