[skip ci] fix typos

despadam · despadam · commit 47826a5b7b1e · 2025-05-12T10:30:30.000+02:00
diff --git a/32_language_modeling_1.ipynb b/32_language_modeling_1.ipynb
@@ -50,7 +50,8 @@
     "The purpose of this multi-part notebook is to give a gentle introduction to the PyTorch library, with a focus on language modeling.\n",
     "At high-level, we will build a progressively more complex **character-level language model** that can generate more text similar to the training data.\n",
     "\n",
-    "The final result is not meant to be a \"production-ready\" language model, but rather a simple yet effective example of how to use PyTorch for language modeling. Along the way, we will learn the fundamental building blocks that lay the groundwork for more complex models, including the base models that powers the state-of-the-art LLMs and derived products, like our friendly and always helpful assistant ChatGPT.\n",
+    "The final result is not meant to be a \"production-ready\" language model, but rather a simple yet effective example of how to use PyTorch for language modeling.\n",
+    "Along the way, we will learn the fundamental building blocks that lay the groundwork for more complex models, including the base models that power the state-of-the-art LLMs and derived products, like our friendly and always helpful assistant ChatGPT.\n",
     "\n",
     "The final implementation will allow you to experiment with different models, starting from the most simple and basic one (a **bigram** model) to a more complex **RNN** and finally a **Transformer** model.\n"
    ]
@@ -454,7 +455,7 @@
    "source": [
     "The results are quite terrible, although they're reasonable given the simplicity of the model and the patterns we're trying to capture.\n",
     "\n",
-    "The core problem is that a bigram model looks only the the frequency of a pair of tokens, but it has zero information of what's most likely to come before or after those two tokens.\n",
+    "The core problem is that a bigram model looks only at the frequency of a pair of tokens, but it has zero information of what's most likely to come before or after those two tokens.\n",
     "You can imagine that the obvious next step is a **trigram** model, which looks at the frequency of a triplet of tokens.\n",
     "\n",
     "Let's now improve a bit our code: the first thing is to compute **all** the probabilities once, and then sample from them.\n",
@@ -545,9 +546,11 @@
    "source": [
     "We have built a bigram language model by counting letter combination frequencies, then normalizing and sampling with that probability base.\n",
     "\n",
-    "We trained the model, we sampled from the model (iteratively, character-wise). But its still bad at coming up with names.\n",
+    "We trained the model, we sampled from the model (iteratively, character-wise).\n",
+    "But its still bad at coming up with names.\n",
     "\n",
-    "But how bad? We know that the model's \"knowledge\" is represented by `P`, but how can we boil down the model's quality in one value?\n",
+    "But how bad?\n",
+    "We know that the model's \"knowledge\" is represented by `P`, but how can we boil down the model's quality in one value?\n",
     "\n",
     "First, let's look at the bigrams we created from the dataset: the bigrams to `emma` are `.e, em, mm, ma, a.`.\n",
     "**What probability does the model assign to each of those bigrams?**"
@@ -622,7 +625,8 @@
     "We calculated a negative log-likelihood, because this follows the convention of setting the goal to minimize the **loss function**, the function that drives the optimization (i.e., training) process.\n",
     "The lower the loss/negative log-likelihood, the better the model.\n",
     "\n",
-    "We got $2.45$ for the model. The lower, the better.\n",
+    "We got $2.45$ for the model.\n",
+    "The lower, the better.\n",
     "We need to find the parameters that reduce this value.\n",
     "\n",
     "**Goal:** Maximize likelihood of the trained data w. r. t. model parameters in `P`\n",
@@ -850,12 +854,14 @@
     "Something like: $\\text{output} = \\text{activation}(\\text{weights} \\cdot \\text{input} + \\text{bias})$\n",
     "\n",
     "If we were to feed our characters as integer indexes, we would have a sequence of integer indexes as input.\n",
-    "If `a` is 1 and `z` is 25, the weight applied to `z` will have 25 times more impact on the output than `a`. This creates an arbitrary and misleading mathematical relationship.\n",
+    "If `a` is 1 and `z` is 25, the weight applied to `z` will have 25 times more impact on the output than `a`.\n",
+    "This creates an arbitrary and misleading mathematical relationship.\n",
     "\n",
     "Moreover, during the training (optimization) phase, the updates to the weights will be proportional to their input values.\n",
     "Larger input values will cause larger updates to the weights, which can lead to unstable training.\n",
     "\n",
-    "And lastly, the network has no reference to the potential value range. It doesn't know that the values are constrained to a specific set (like 0-25 for letters).\n",
+    "And lastly, the network has no reference to the potential value range.\n",
+    "It doesn't know that the values are constrained to a specific set (like 0-25 for letters).\n",
     "\n",
     "To address all these issues, we can use **one-hot encoding**.\n",
     "One-hot encoding each letter means creating a vector where only one position has a value of 1 (corresponding to that letter's position in the alphabet) and all other positions are 0.\n",
@@ -921,7 +927,7 @@
    "metadata": {},
    "source": [
     "One problem is the `dtype` of the one-hot encoded tensor.\n",
-    "It's `torch.int64` by default (inferred from our data), but we need `torch.float32` to have a the input suitable for the mathematical operations the network will perform.\n",
+    "It is `torch.int64` by default (inferred from our data), but we need `torch.float32` to have an input suitable for the mathematical operations the network will perform.\n",
     "\n",
     "We can convert it using `.float()`:"
    ]
@@ -972,7 +978,8 @@
    "metadata": {},
    "source": [
     "`W` is a **single** neuron.\n",
-    "Multiplying it by `xenc` makes it 'react' to the one-hot encoded input. The result is a $5\\times 1$ vector.\n",
+    "Multiplying it by `xenc` makes it 'react' to the one-hot encoded input.\n",
+    "The result is a $5\\times 1$ vector.\n",
     "\n",
     "`.emma` has $5$ characters, we have $1$ neuron.\n",
     "When this neuron processes the 5 characters of \".emma\", it produces 5 activation values, one for each character.\n",
@@ -1041,7 +1048,7 @@
    "metadata": {},
    "source": [
     "We want the neurons per input (per character) to come up with a $27$-dimensional activation of values that could be transformed into a normal distribution on what character to choose next.\n",
-    "We've seen that with the Bigram's probability distribution, given info per character on what character ist most likely to follow.\n",
+    "We've seen that with the Bigram's probability distribution, given info per character on what character is most likely to follow.\n",
     "\n",
     "Right now, for every character we get $27$ numbers, positive and negative, but not following a normal distribution.\n",
     "\n",
@@ -1091,7 +1098,7 @@
    "source": [
     "It might seem unusual, but after this transformation, we have a set of numbers that we can use just like the actual counts from the bigram model.\n",
     "\n",
-    "All the values are non-negative—think of them as \"pseudo-counts.\"\n",
+    "All the values are non-negative: think of them as \"pseudo-counts.\"\n",
     "Now, our goal is simply to adjust the weights `W` so that the network produces the correct character indices as output."
    ]
   },
@@ -1127,7 +1134,8 @@
    "id": "77",
    "metadata": {},
    "source": [
-    "We can now evaluate how well our neural network predicts the next character in a sequence. For each bigram (pair of consecutive characters), we:\n",
+    "We can now evaluate how well our neural network predicts the next character in a sequence.\n",
+    "For each bigram (pair of consecutive characters), we:\n",
     "\n",
     "- Feed the input character to the neural network.\n",
     "- Get the predicted probability distribution for the next character.\n",
@@ -1195,9 +1203,11 @@
     "- In each case, the probability assigned to the correct next character is relatively low, meaning the model is not yet confident in its predictions.\n",
     "- The most likely character predicted by the model is often not the correct one.\n",
     "- The negative log likelihood values are relatively high, indicating the model is “surprised” by the true next character.\n",
-    "- The final line reports the average negative log likelihood (loss) across all bigrams: 3.44. This is a key metric for training—the goal is to minimize this value by adjusting the model’s weights.\n",
+    "- The final line reports the average negative log likelihood (loss) across all bigrams: 3.44.\n",
+    "This is a key metric for training — the goal is to minimize this value by adjusting the model’s weights.\n",
     "\n",
-    "The network is currently not very accurate at predicting the next character, as shown by the low probabilities for the correct answers and the high loss. This is expected at the start, before any training."
+    "The network is currently not very accurate at predicting the next character, as shown by the low probabilities for the correct answers and the high loss.\n",
+    "This is expected at the start, before any training."
    ]
   },
   {
@@ -1216,10 +1226,11 @@
     "Remember that we started with `W` as a completely random matrix of floats.\n",
     "Hoping that random initialization would yield a good solution is like hoping that a random collection of Lego bricks will build a house.\n",
     "\n",
-    "Instead, we will actively improve the model’s predictions. Specifically, we will adjust the weights in the matrix `W` to increase the probability of correctly predicting the second character in each bigram.\n",
+    "Instead, we will actively improve the model’s predictions.\n",
+    "Specifically, we will adjust the weights in the matrix `W` to increase the probability of correctly predicting the second character in each bigram.\n",
     "\n",
     "This is done by computing how the loss changes with respect to each weight (i.e., calculating the gradients), and then updating the weights in a way to reduce the overall loss.\n",
-    "This process—called **gradient-based optimization**—enables the neural network to learn from its mistakes and become better at predicting the next character in the sequence."
+    "This process — called **gradient-based optimization** — enables the neural network to learn from its mistakes and become better at predicting the next character in the sequence."
    ]
   },
   {
@@ -1415,7 +1426,7 @@
    "id": "95",
    "metadata": {},
    "source": [
-    "A this step, we can optionally tell PyTorch to use a GPU if available."
+    "At this step, we can optionally tell PyTorch to use a GPU if available."
    ]
   },
   {
@@ -1532,7 +1543,7 @@
     "While our current neural network doesn't outperform the simpler bigram approach, its architecture allows for natural extension to more complex patterns.\n",
     "As we add layers and consider longer sequences of characters, the neural network framework will scale elegantly.\n",
     "\n",
-    "Consider the fundamental scaling challenge: with a bigram model looking at just the previous character, we need to store $27^2 = 729$ probabilities (one for each possible character pair).\n",
+    "Consider the fundamental scaling challenge: with a bigram model looking at just the previous character, we need to store $27^2 = 729$ probabilities (one for each possible character pairs).\n",
     "If we wanted to consider the previous 10 characters to make better predictions:\n",
     "\n",
     "- A traditional n-gram approach would require storing $27^{10} \\approx 205$ trillion different probability values—completely impractical in terms of memory and impossible to train with limited data.\n",