|
81 | 81 | "Python libraries are collections of pre-written code that provide reusable functions and tools for specific tasks.\n", |
82 | 82 | "They significantly extend the capabilities of Python, allowing you to perform complex operations without writing everything from scratch.\n", |
83 | 83 | "This saves time and effort, promotes code reusability, and helps you focus on the higher-level logic of your data analysis and visualization workflows.\n", |
| 84 | + "\n", |
84 | 85 | "Have a look at the following plot: imagine you'd have to code every detail from scratch! Instead, we rely on the help of libraries to do most of the heavy lifting for us.\n", |
85 | 86 | "However, there is still quite a bit to do before we can reproduce exactly this.\n", |
86 | 87 | "Let's have a look at the following plot generated with Plotly." |
|
250 | 251 | "#### Getting help and inspiration\n", |
251 | 252 | "\n", |
252 | 253 | "Of course one of the most important parts is to be able to understand, look up and get help on any function of a library.\n", |
253 | | - "Usually, we start with some inspiration, as we gave above with the plot, there might be someone who posted something which you would like to reproduce but with a twist or you would like to change something.\n", |
| 254 | + "Typically, the creative process begins with an inspiration, such as a plot or an idea you've encountered elsewhere.\n", |
| 255 | + "You might wish to reproduce this concept with a unique twist or modify specific elements to suit your vision.\n", |
254 | 256 | "This is generally a good starting point.\n", |
255 | | - "However afterwards you won't have the documentation of all the functions so you need to have the skill to find documentation and understand the requirements for function, sometimes you even need to know more about the inner workings of a functions implementations.\n", |
256 | 257 | "\n", |
257 | | - "There are several ways to access documentation.\n", |
| 258 | + "After that, though, you won't have a full documentation guide for every function.\n", |
| 259 | + "You'll need to know how to track down the info you need, figure out what each function requires, and sometimes even dig into how they work under the hood.\n", |
| 260 | + "\n", |
| 261 | + "There are several ways to access a library's documentation.\n", |
258 | 262 | "One way, assuming it is a well-maintained package online, is to find the documentation website.\n", |
259 | | - "For pandas, this is a great place to find details on functions, and changes that may have been made with different versions and explore alternatives to a given function.\n", |
| 263 | + "For pandas, this is a great place to find details on functions, or changes that may have been made in different versions of the library, as well as alternatives to a given function.\n", |
260 | 264 | "\n", |
261 | 265 | "[Pandas Documentation](https://pandas.pydata.org/docs/index.html)\n", |
262 | 266 | "\n", |
|
270 | 274 | "And if you are struggling with a specific function you can print out the signature and the docstring with `help(function)` or `function?`.\n", |
271 | 275 | "\n", |
272 | 276 | "When reading the documentation you might get overwhelmed.\n", |
273 | | - "Keep a lookout for the function parameters, many of which may be optional, and good documentations tend to have an example to get a feel for the function.\n", |
| 277 | + "Keep a lookout for the function parameters, many of which may be optional.\n", |
274 | 278 | "\n", |
275 | | - "For most of this tutorial we will give you the infos about a function that you need, however if you want to know more or need some extra information then use these tools to inform yourself." |
| 279 | + "For most of this tutorial we will give you the info about a function that you need, however if you want to know more or need some extra information then use these tools to inform yourself." |
276 | 280 | ] |
277 | 281 | }, |
278 | 282 | { |
|
281 | 285 | "source": [ |
282 | 286 | "## First step: Data import and exploration\n", |
283 | 287 | "\n", |
284 | | - "Already getting your data from a file to a variable you can work with can be a headache.\n", |
| 288 | + "Getting your data from a file to a variable you can work with can be a headache.\n", |
285 | 289 | "How do I read the file, how do I choose delimiters and what encoding does the file have?\n", |
286 | 290 | "\n", |
287 | 291 | "We will use the `pd.read_csv` function from pandas to read \"The World Happiness Report\" which is a report study on how people rate their happiness in different countries." |
|
352 | 356 | "metadata": {}, |
353 | 357 | "outputs": [], |
354 | 358 | "source": [ |
355 | | - "happyness = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')\n", |
356 | | - "happyness.describe()" |
| 359 | + "happiness = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')\n", |
| 360 | + "happiness.describe()\n", |
| 361 | + "happiness.groupby('year')['Generosity'].mean()" |
357 | 362 | ] |
358 | 363 | }, |
359 | 364 | { |
|
381 | 386 | "| `pd.DataFrame['column_name'].value_counts()` | Returns a Series containing counts of unique values in a column. | `df['Status'].value_counts()` |\n", |
382 | 387 | "| `pd.DataFrame.sort_values(by='column_name')` | Sorts the DataFrame by the values in a specified column. | `df.sort_values(by='Date')` |\n", |
383 | 388 | "| `pd.DataFrame.sort_index()` | Sorts the DataFrame by its index. | `df.sort_index()` |\n", |
384 | | - "| `pd.DataFrame.isna().sum()` | Returns the number of missing (NaN) values in each column. | `df.isna().sum()` |\n", |
| 389 | + "| `pd.DataFrame.isnull().sum()` | Returns the number of missing (NaN, None) values in each column. | `df.isna().sum()` |\n", |
| 390 | + "| `pd.DataFrame.isna().sum()` | Same as `isnull`, but `isna` is the newer preferred alias. | `df.isna().sum()` |\n", |
385 | 391 | "| `pd.DataFrame.duplicated().sum()` | Returns the number of duplicate rows in the DataFrame. | `df.duplicated().sum()` |\n", |
386 | 392 | "| `pd.DataFrame['column_name'].unique()` | Returns a NumPy array of the unique values in a column. | `df['Country'].unique()` |\n", |
387 | 393 | "| `pd.DataFrame.sample(n=5)` | Returns a random sample of items from the DataFrame (default is 1). | `df.sample(n=10)` |\n", |
|
414 | 420 | "source": [ |
415 | 421 | "import pandas as pd\n", |
416 | 422 | "\n", |
417 | | - "happyness = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')\n", |
| 423 | + "df = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')\n", |
418 | 424 | "\n", |
419 | | - "# Assuming your dataframe is loaded into 'df'\n", |
420 | | - "df = happyness\n", |
421 | 425 | "print(\"--- First few rows of the dataframe ---\")\n", |
422 | 426 | "print(df.head(2))\n", |
423 | 427 | "print(\"\\n\")\n", |
|
448 | 452 | " print(f\"Minimum year: {min_year}\")\n", |
449 | 453 | " print(f\"Maximum year: {max_year}\")\n", |
450 | 454 | " print(f\"Year range: {max_year - min_year} years\")\n", |
451 | | - " print(\"\\n\")\n", |
452 | | - "\n" |
| 455 | + " print(\"\\n\")" |
453 | 456 | ] |
454 | 457 | }, |
455 | 458 | { |
|
483 | 486 | "## Finding the limits\n", |
484 | 487 | "\n", |
485 | 488 | "If we are plotting a function, it is important to know the order of magnitude of some of the data.\n", |
486 | | - "In our case for example we want to have an animated plot over some years and it helps to know for which years we actually have data.\n", |
487 | | - "In a dataframe we can e.g. use the `.min()` and `.max()` methods. \n", |
| 489 | + "For example, in our case we want to have an animated plot over some years and it helps to know for which years we actually have data.\n", |
| 490 | + "To do that in a dataframe we can use the `.min()` and `.max()` methods. \n", |
488 | 491 | "Optionally, to understand the distribution or \"order of magnitude\" of your time values, you might want to plot out the years and check the rough distribution to identify any anomalies or gaps in the data.\n", |
489 | 492 | "This can be done using a histogram or a line plot to visualize the frequency or trend of the time values over the range.\n", |
490 | 493 | "\n", |
491 | 494 | "We want to use `matplotlib.pyplot` for displaying the histogram because it has a useful function hist which does exactly that. \n", |
492 | | - "\n", |
493 | | - "We use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram." |
| 495 | + "To do that we use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram." |
494 | 496 | ] |
495 | 497 | }, |
496 | 498 | { |
|
502 | 504 | "import pandas as pd\n", |
503 | 505 | "import matplotlib.pyplot as plt\n", |
504 | 506 | "\n", |
505 | | - "happiness = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')\n", |
506 | | - "years = happiness['year'].unique()\n", |
| 507 | + "df = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')\n", |
| 508 | + "years = df['year'].unique()\n", |
507 | 509 | "print(f\"Unique years in the dataset: {sorted(years)}\")\n", |
508 | 510 | "\n", |
509 | | - "df = happiness\n", |
510 | | - "df['year'] = happiness['year'].astype(int) # Ensure the years are integers\n", |
| 511 | + "df['year'] = df['year'].astype(int) # Ensure the years are integers\n", |
511 | 512 | "\n", |
512 | 513 | "# Determine the minimum and maximum years\n", |
513 | 514 | "min_year = df['year'].min()\n", |
|
549 | 550 | "source": [ |
550 | 551 | "### Exercise: Complete Happiness\n", |
551 | 552 | "\n", |
552 | | - "In this exercise, we want to complete the dataframe with missing values.\n", |
553 | | - "Complete the function below to \n", |
| 553 | + "In this exercise, we want to complete the dataframe's missing values.\n", |
| 554 | + "Complete the function below to:\n", |
554 | 555 | "\n", |
555 | 556 | "1. Fill in missing years for every country (so we have an entry for every year between 2005 and 2023 and every country).\n", |
556 | 557 | " Do this by initializing a DataFrame with `pd.DataFrame()` with a list.\n", |
|
743 | 744 | "\n", |
744 | 745 | "The data can be seen as the plot data initially, and the frames are then the animation steps.\n", |
745 | 746 | "\n", |
746 | | - "Let's first try to create a simple scatter plot, for that we populate a figure dictionary data with a trace (`dict`) which contains an array of values for `x`, an array of values for `y`, a `mode` ('markers') and an array of strings for the text which is what appears when hovered over.\n", |
| 747 | + "Let's first try to create a simple scatter plot.\n", |
| 748 | + "For that, we populate a figure dictionary data with a trace (`dict`) which contains an array of values for `x`, an array of values for `y`, a `mode` ('markers') and an array of strings for the text, which is what appears when hovered over.\n", |
747 | 749 | "\n", |
748 | 750 | "```python\n", |
749 | 751 | "trace = {\n", |
|
867 | 869 | "source": [ |
868 | 870 | "### Adding slider bar for time scale\n", |
869 | 871 | "\n", |
870 | | - "The slider needs configuring, this would require a bit of reading up what exactly you need or if you have an example you can make use of the existing functions.\n", |
| 872 | + "The slider needs configuring.\n", |
| 873 | + "This would require a bit of reading up on what exactly you need or, if you have an example, you can make use of the existing functions.\n", |
871 | 874 | "The following is heavily inspired by the module [`bubbly`](https://github.com/AashitaK/bubbly).\n", |
872 | 875 | "This is simply a configuration and contains only the years data." |
873 | 876 | ] |
|
1252 | 1255 | "However, if you are browsing through possible libraries to use you might also find less well-maintained libraries, ones that may only have a single author and ones that haven't been touched in a while. \n", |
1253 | 1256 | "\n", |
1254 | 1257 | "Here we give you a direct example, this tutorial was inspired by the [bubbly](https://github.com/AashitaK/bubbly) package.\n", |
1255 | | - "However, with an update from the pandas library it is no longer compatible with newer versions of pandas and will through an error (see codeblock below). \n", |
| 1258 | + "However, with an update from the pandas library it is no longer compatible with newer versions of pandas and will throw an error (see codeblock below). \n", |
1256 | 1259 | "So what to do in that case?\n", |
1257 | 1260 | "\n", |
1258 | | - "There are many options, you can inform the author of this problem on GitHub.\n", |
1259 | | - "Of course, they may not have time to fix this.\n", |
| 1261 | + "There are many options.\n", |
| 1262 | + "For example, you can inform the author of this problem on GitHub.\n", |
| 1263 | + "But of course, they may not have time to fix this.\n", |
1260 | 1264 | "You can find a different library, however, it might not be exactly the way you wanted it.\n", |
1261 | | - "You can downgrade your pandas library to be compatible, if you use pip show pandas you will see what version you have, it is possible to uninstall and reinstall a specific version.\n", |
1262 | | - "However, this might not be feasible if you need it in other places and is generally not a pretty solution. \n", |
| 1265 | + "You can downgrade your pandas library to be compatible.\n", |
| 1266 | + "If you use `pip show pandas` you will see what version you have, and it is possible to uninstall and reinstall a specific version.\n", |
| 1267 | + "However, this might not be feasible if you need it in other places and is generally not the best solution. \n", |
1263 | 1268 | "Last but not least you can try to fix it yourself.\n", |
1264 | 1269 | "\n", |
1265 | 1270 | "So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data.plotly_intro`.\n", |
1266 | 1271 | "It is quite a short library so quite managable.\n", |
1267 | 1272 | "Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/plotly_intro/bubbly.py` until the same code below compiles.\n", |
1268 | 1273 | "\n", |
1269 | | - "Note: You will need to restart the kernel after changes to the packages.\n", |
| 1274 | + "Note: You will need to restart the kernel after applying changes to the packages.\n", |
1270 | 1275 | "\n", |
1271 | | - "(If you are interested in a solution, we have a fixed version under tutorial.my_bubbly.py, feel free to check the differences.)" |
| 1276 | + "(If you are interested in a solution, we have a fixed version under `tutorial.my_bubbly.py`, feel free to check the differences.)" |
1272 | 1277 | ] |
1273 | 1278 | }, |
1274 | 1279 | { |
|
0 commit comments