<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Dimensionality Reduction: LLM's from scratch series]]></title><description><![CDATA[A multipart series that rebuilds Large Language Models from first principles — starting with the mathematical foundations (derivatives, gradients, backpropagation, optimization), then moving through the core building blocks (embeddings, attention, transformers, training objectives), and finally into practical LLM engineering (tokenization, scaling, evaluation, inference, tooling, and production concerns). The goal is to make every concept intuitive and rigorous, connecting the theory to how modern LLM systems actually work.]]></description><link>https://www.dimensionalityreduction.com/s/llms-from-scratch-series</link><image><url>https://substackcdn.com/image/fetch/$s_!hB8u!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245f5146-493e-4ef9-bcfb-83cdb3311ca0_1024x1024.png</url><title>Dimensionality Reduction: LLM&apos;s from scratch series</title><link>https://www.dimensionalityreduction.com/s/llms-from-scratch-series</link></image><generator>Substack</generator><lastBuildDate>Tue, 14 Apr 2026 22:28:35 GMT</lastBuildDate><atom:link href="https://www.dimensionalityreduction.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Nuno Fonseca]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[dimensionalityreduction@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[dimensionalityreduction@substack.com]]></itunes:email><itunes:name><![CDATA[Nuno Fonseca]]></itunes:name></itunes:owner><itunes:author><![CDATA[Nuno Fonseca]]></itunes:author><googleplay:owner><![CDATA[dimensionalityreduction@substack.com]]></googleplay:owner><googleplay:email><![CDATA[dimensionalityreduction@substack.com]]></googleplay:email><googleplay:author><![CDATA[Nuno Fonseca]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Large Language Models from scratch - Part 1]]></title><description><![CDATA[Or how to deconstruct the discovery of the century in (mostly) understandable parts.]]></description><link>https://www.dimensionalityreduction.com/p/large-language-models-from-scratch</link><guid isPermaLink="false">https://www.dimensionalityreduction.com/p/large-language-models-from-scratch</guid><dc:creator><![CDATA[Nuno Fonseca]]></dc:creator><pubDate>Fri, 23 Jan 2026 01:39:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!791n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!791n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!791n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!791n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!791n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!791n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!791n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1887794,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!791n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!791n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!791n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!791n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94afd57e-9b5b-4c3c-aa83-507c46ce4caf_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Large Language Models are everywhere now. OpenAI&#8217;s ChatGPT took the world by storm in late 2022 and since then the number of LLM&#8217;s exploded, whether they be proprietary, open-weights or open-source. At the time of this writing, more than <strong>2.3M models</strong> are registered in Hugging Face, <strong>317k</strong> alone in the category of &#8216;<a href="https://huggingface.co/models?pipeline_tag=text-generation&amp;sort=trending">Text Generation</a>&#8217;. OpenAI reportedly has <a href="https://techcrunch.com/2025/10/06/sam-altman-says-chatgpt-has-hit-800m-weekly-active-users/#:~:text=Sam%20Altman%20says%20ChatGPT%20has%20hit%20800M%20weekly%20active%20users%20%7C%20TechCrunch">800M active weekly users</a>, Anthropic reached <a href="https://www.anthropic.com/news/anthropic-raises-series-f-at-usd183b-post-money-valuation">300,000 business customers</a> and approximately 30 million monthly active users for its Claude AI assistant as of Q2 2025. Google&#8217;s Gemini App <a href="https://blog.google/products/gemini/gemini-3/#note-from-ceo">surpassed 650 million users per month</a>, and AI Overviews - the intelligent summary that appears at the top when you do a Google search - now have <a href="https://blog.google/products/gemini/gemini-3/#note-from-ceo">2 billion users every month</a>. We&#8217;re definitely witnessing a disruptive technology that shows no signs of decelerating.</p><p>But what exactly are Large Language Models? How do they work, what are their mathematical foundations, how do they differ from each other, but most importantly: what are the steps to build one from scratch? That's the goal for these series of posts that will be as comprehensible as possible. There are a lot of not so trivial concepts that one needs to be comfortable with in order to grasp the core ideas behind LLM&#8217;s, and what better way to understand the process than actually developing one from scratch? The idea is to simplify as much as possible the theory, reducing it to its first principles, but at the same time deliver an organic whole that encompasses state-of-the-art techniques not too far away from production grade LLM&#8217;s. Of course, training a production grade model requires substantial investment, so we will adapt for it to be manageable in consumer-grade hardware.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.dimensionalityreduction.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dimensionality Reduction! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>What&#8217;s on the menu on this part?</p><ol><li><p>Machine Learning Models</p></li><li><p>Neural Networks</p><ol><li><p>Neural Network Inference</p></li><li><p>Neural Network Training</p><p></p></li></ol></li></ol><p>Let&#8217;s begin.</p><p>What you see today when you interact with ChatGPT is a rather simple system: there&#8217;s a chat input where you write what you want the model to answer to, hit enter, and then you wait for the model response. Depending on the model you can also upload documents, images, sound as a context to the chat prompt; you can also select whether you want the model to &#8220;think&#8221; harder, do a &#8220;deep research&#8221;, search the web, etc. But in the end of the day you have a system that receives an input, and a black box that ruminates on that input and outputs an answer.</p><p>But, what&#8217;s in that box?</p><p>For the purpose of our reflexion we will consider that behind such a system is only one model. As you will see, a production-grade such system may rely on several models, not just one and not all from the same type, but we will focus our analysis specifically on one type of model: a Large Language Model.</p><h3><strong>Machine Learning Models</strong></h3><p>An LLM, as the name implies, is first and foremost a <strong>Language Model</strong> that is Large. When we say Language Model we mean a Machine Learning Language Model. A Machine Learning model is what you end up with when training an algorithm on data, typically lots of it, being essentially a complex function that learns patterns in the training data to then make predictions or decisions on new, unseen data without explicit programming for every scenario. In this regard we can say that a machine automatically learns and generalizes it&#8217;s understanding of the training dataset it was fed with in the first place and is able to answer, most of the time correctly, on data that it has never seen.</p><p>Machine learning algorithms are typically divided in 3 types:</p><ul><li><p><strong>Supervised Learning</strong>: Learns from labeled data (input-output pairs) to predict outcomes. It&#8217;s normally used in two types of problems that mostly differ on whether the results are discrete or continuous: <strong>classification problems</strong> - where the response belongs to a set of classes (e.g., spam detection) - and <strong>regression problems</strong> - where the response is continuous (e.g., house price prediction).</p><p></p></li><li><p><strong>Unsupervised Learning</strong>: Finds patterns in unlabeled data, often grouping similar items (clustering).</p><p></p></li><li><p><strong>Reinforcement Learning</strong>: Learns through trial-and-error, receiving rewards or penalties (e.g., autonomous driving).</p></li></ul><p>Each type has its own set of machine learning algorithms: for classification problems the most popular ones are typically: Support Vector Machines (SVM),  Decision Trees, Ensemble Trees, Naive Bayes, <em>k</em>-Nearest Neighbors (KNN), and&#8230;Neural Networks. </p><p>For regression problems the most popular ones start with Linear Regression, Polynomial Regression, Decision Tree Regression, Random Forest Regression, Support Vector Regression, and&#8230; Neural Networks.</p><p>As you can see there are lots of machine learning algorithms. I should make an honorary mention of <strong>XGBoost</strong>, that is normally considered the swiss army knife of the ML algorithms, with an excellent track record for both classification and regression problems. It&#8217;s an Ensemble Trees method that normally combines several decision trees - also called weak learners - and then aggregates the result using a <strong>boosting technique</strong>: trees are built sequentially, with each new tree focusing on correcting the errors of the previous ones. There are at least two more techniques regarding the Ensemble Trees method: <strong>bagging</strong>, trains multiple trees independently on different random subsets (with replacement) of the training data (<strong>Random Forest</strong> is a good example of this), and <strong>stack/voting</strong>, that trains different types of models (or same type with different parameters) and uses a meta-model or voting to combine their final predictions.</p><p>Deep diving on each and every one of these algorithms is outside of the scope of our analysis, but we will go in much deeper in just one of them because it&#8217;s at the root of LLMs: Neural Networks.</p><h3><strong>Neural Networks</strong></h3><p>Artificial Neural Networks (ANN) are fascinating. They&#8217;re directly inspired by the biological brain, composed by a network of neurons that communicate via electrochemical signals. Artificial neurons, or nodes, receive input signals, process them, and pass them to other nodes. Biological neurons pass information, or &#8220;fire&#8221;, when the input signal exceeds a certain threshold, and that&#8217;s exactly what happens with artificial neurons: they use activation functions that decide whether they should &#8220;fire&#8221;. The concept of artificial neurons goes way back to the 40&#8217;s with Warren McCulloch and Walter Pitts creating the first mathematical model of an artificial neuron. The field has been advancing since then with the development of the Perceptron, the introduction of the backpropagation algorithm, and a whole series of different and evermore sophisticated ANN architectures.</p><p>Let&#8217;s start with a simple example. Consider the following ANN:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qj1W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qj1W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 424w, https://substackcdn.com/image/fetch/$s_!Qj1W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 848w, https://substackcdn.com/image/fetch/$s_!Qj1W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 1272w, https://substackcdn.com/image/fetch/$s_!Qj1W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qj1W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png" width="1456" height="1229" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1229,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148523,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Qj1W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 424w, https://substackcdn.com/image/fetch/$s_!Qj1W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 848w, https://substackcdn.com/image/fetch/$s_!Qj1W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 1272w, https://substackcdn.com/image/fetch/$s_!Qj1W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f30889a-0e2e-4a12-950e-a76feb1eec51_1614x1362.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This network has an <strong>Input</strong> Layer, one <strong>Hidden</strong> Layer, and an <strong>Output</strong> Layer. Typically all neural networks have one Input and one Output layers, and one or more Hidden layers. As the name implies, the input layer holds as many neurons as needed for the input data of the problem at hand. Similarly, the output layer will hold as many neurons as needed for holding a prediction result. The hidden layers represent the network&#8217;s core <em>feature extraction</em> (more on this in a bit) capability, essentially being where all the computation work is done. The number of nodes on each hidden layer, and the number of hidden layers, is perfectly arbitrary. It really depends on the problem we want to solve, but generally the more hidden layers there are, the deeper the network is (hence the notion of deep learning, applied to deep neural networks), and the more each hidden layer learns increasingly abstract representations of the data that flows from one hidden layer to the next.</p><p>The way information flows from the input to the output layer - also called a <strong>forward pass</strong> - and from one node to another node is by performing <strong>linear calculations</strong> and a <strong>function activation</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iSut!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iSut!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 424w, https://substackcdn.com/image/fetch/$s_!iSut!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 848w, https://substackcdn.com/image/fetch/$s_!iSut!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 1272w, https://substackcdn.com/image/fetch/$s_!iSut!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iSut!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222248,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iSut!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 424w, https://substackcdn.com/image/fetch/$s_!iSut!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 848w, https://substackcdn.com/image/fetch/$s_!iSut!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 1272w, https://substackcdn.com/image/fetch/$s_!iSut!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdea9fc-3e19-4c47-9bc9-50d0b1e88a0f_1498x842.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each neuron performs a linear calculation based on all its input connections and a function activation on the result of the linear calculation.</p><p>First it sums up all the information coming in the neuron:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z = \\sum_{i=1}^{n} w_i x_i + b&quot;,&quot;id&quot;:&quot;KGGVQWDADQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_i&quot;,&quot;id&quot;:&quot;JZEJTXMNUV&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Input / Feature</strong>: The data being fed into the neuron. We will get back to this <em>feature</em> notion a bit later.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_i&quot;,&quot;id&quot;:&quot;ZBYQEXXCZZ&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Weight</strong>: Determines the importance of each input feature. It acts as a slope, controlling how strongly the input influences the output.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;b&quot;,&quot;id&quot;:&quot;TTDZNFHAXD&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Bias</strong>: A trainable parameter that shifts the activation function left or right, allowing the model to fit data that does not pass through the origin. It mostly adds flexibility in what would otherwise be just a sequence of multiplications.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z&quot;,&quot;id&quot;:&quot;CECLKZEPMZ&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Pre-activation / Linear Combination</strong>: The result of the linear transformation before applying the activation function.</p><p></p><p>Lastly we apply the <strong>activation function</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y = \\alpha(z)&quot;,&quot;id&quot;:&quot;HGYQMNUNEL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\alpha&quot;,&quot;id&quot;:&quot;WYHALWKGEF&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Activation function</strong>: sigmoid, ReLU, tanh, softmax, etc. These non-linear functions  introduce non-linearity in what would be just a sum of linear equations, collapsing all hidden layers and severely impacting the network&#8217;s ability to learn complex patterns in data.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y&quot;,&quot;id&quot;:&quot;GXSWWMKOFR&quot;}" data-component-name="LatexBlockToDOM"></div><p>The final result that is <strong>passed to the next node</strong>.</p><h3><strong>Neural Network Inference</strong></h3><p>Let&#8217;s apply a simple problem to the example we gave earlier that hopefully will put these simple calculations into some context. Consider the scenario of a student that needs to pass a final exam, having attended its respective classes and having studied for some hours per day. We consider <strong>x1 is the number of hours a student studies per day</strong>, <strong>x2 is the percentage of class attendance</strong>, and we want to predict <strong>if the student will pass or fail the exam</strong>. There are several ways to model even this simple problem, we could try to infer the actual continuous score of the exam from 0-100%. Or to infer the discrete grade - from A to F - or just infer that he will pass or fail the exam, as a binary classification. These three different outcomes imply three different output layers, activation functions and, as we will see, loss functions. But we will stick to the simplest example and predict simply whether the student will pass or fail the exam.</p><p>Let&#8217;s consider, as an input example:</p><ul><li><p>x1: 5 hours</p></li><li><p>x2: 60%</p></li></ul><p>So the student studied <strong>5 hours a day</strong> and attended <strong>60% of classes</strong>. Let&#8217;s find out if, according to the network, he will pass or fail the exam.</p><p>An <strong>already</strong> <strong>trained network</strong> behaves like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kns8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kns8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 424w, https://substackcdn.com/image/fetch/$s_!Kns8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 848w, https://substackcdn.com/image/fetch/$s_!Kns8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!Kns8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kns8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png" width="1456" height="1279" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1279,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:259445,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kns8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 424w, https://substackcdn.com/image/fetch/$s_!Kns8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 848w, https://substackcdn.com/image/fetch/$s_!Kns8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!Kns8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda310ffe-eae3-437a-b975-9e7c8d03f52b_1628x1430.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We input both variables in the Input layer. Notice that both x1 and x2 are normalized values between [0, 1]. To normalize the x1 value we consider 10h as a maximum value, so 5h is 50%, or 0.5. Regarding attendance, 60% is already in the [0,1] interval. It&#8217;s important that all the values be normalized so both inputs contribute proportionally and no feature dominates the others.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZPK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZPK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 424w, https://substackcdn.com/image/fetch/$s_!lZPK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 848w, https://substackcdn.com/image/fetch/$s_!lZPK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!lZPK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png" width="1456" height="1280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1280,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:305859,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lZPK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 424w, https://substackcdn.com/image/fetch/$s_!lZPK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 848w, https://substackcdn.com/image/fetch/$s_!lZPK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!lZPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d964eca-51cb-4c47-bbaa-96f80dc5c6cd_1626x1430.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Then for each connection between every input node and a hidden layer&#8217;s node there&#8217;s a <strong>weight</strong> value attributed to that connection. In the picture above you can see, for instance, between <strong>x1</strong> and <strong>h1</strong> the value of <strong>+0.82</strong>. And between <strong>x1</strong> and <strong>h3</strong> the value of <strong>-0.45</strong>. Each connection then has a <strong>learned weight</strong>:</p><pre><code>From Study Hours (x&#8321;):
  &#8594; h&#8321;: w = 0.82
  &#8594; h&#8322;: w = 0.71
  &#8594; h&#8323;: w = -0.45</code></pre><pre><code>From Attendance (x&#8322;):
  &#8594; h&#8321;: w = 0.65
  &#8594; h&#8322;: w = 0.89
  &#8594; h&#8323;: w = 0.78</code></pre><p>These values were obtained during training and are static once the network is trained. Whatever the inputs, these values are always the same. Bias values are also learned and static, we show them next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HYNf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HYNf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 424w, https://substackcdn.com/image/fetch/$s_!HYNf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 848w, https://substackcdn.com/image/fetch/$s_!HYNf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 1272w, https://substackcdn.com/image/fetch/$s_!HYNf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HYNf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png" width="1456" height="1284" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1284,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:421680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HYNf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 424w, https://substackcdn.com/image/fetch/$s_!HYNf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 848w, https://substackcdn.com/image/fetch/$s_!HYNf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 1272w, https://substackcdn.com/image/fetch/$s_!HYNf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd213d48d-84fd-4ce5-b336-76cbbe9f66cd_1626x1434.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><pre><code>Biases: [-0.50, -0.60, -0.30]</code></pre><p>You can see biases on the top right corner of each hidden layer&#8217;s node.</p><p>Next we perform linear calculations referenced earlier, so for each hidden layer&#8217;s node we compute the following for every incoming connection:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z = \\sum_{i=1}^{n} w_i x_i + b&quot;,&quot;id&quot;:&quot;WMFDNCCODI&quot;}" data-component-name="LatexBlockToDOM"></div><p>And we obtain the following values, that you can see in the picture on each h1, h2 and h3 nodes:</p><pre><code>h&#8321;: z = (0.82 * 0.500) + (0.65 * 0.600) - 0.50 = 0.3000
h&#8322;: z = (0.71 * 0.500) + (0.89 * 0.600) - 0.60 = 0.2890
h&#8323;: z = (-0.45 * 0.500) + (0.78 * 0.600) - 0.30 = -0.057</code></pre><p>So now, after linear calculations you can see each result on each hidden layer&#8217;s node. Now we perform the <strong>ReLU</strong> function activation over each hidden layer node result as we stated earlier:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y = \\alpha(z)&quot;,&quot;id&quot;:&quot;VBYKGUFWAO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Which in ReLU&#8217;s case:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f(z) = max(0, z)&quot;,&quot;id&quot;:&quot;IDIZYGOUDF&quot;}" data-component-name="LatexBlockToDOM"></div><p>ReLU is an acronym for <strong>Rectified Linear Unit</strong>. Basically resets negative values to zero and lets through positive values. We could&#8217;ve chosen other functions, there are various alternatives, we will go deeper on this later.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!arb5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!arb5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 424w, https://substackcdn.com/image/fetch/$s_!arb5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 848w, https://substackcdn.com/image/fetch/$s_!arb5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 1272w, https://substackcdn.com/image/fetch/$s_!arb5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!arb5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png" width="1456" height="1281" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1281,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:417114,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!arb5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 424w, https://substackcdn.com/image/fetch/$s_!arb5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 848w, https://substackcdn.com/image/fetch/$s_!arb5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 1272w, https://substackcdn.com/image/fetch/$s_!arb5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0c5f16-4980-42a4-b41e-415bf8b6c39f_1618x1424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, looking to h1, h2 and h3 nodes, only node <strong>h3</strong> changed value from a negative value <strong>-0.057</strong> to <strong>0</strong>. This is because, as we said earlier, ReLU resets negative values to zero and lets through positive values. That was what happened to <strong>h1</strong> and <strong>h2</strong> nodes, whose value stayed the same. So at this point on each h1, h2 and h3 we have final linear calculations plus ReLU activation function results.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XdaC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XdaC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 424w, https://substackcdn.com/image/fetch/$s_!XdaC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 848w, https://substackcdn.com/image/fetch/$s_!XdaC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 1272w, https://substackcdn.com/image/fetch/$s_!XdaC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XdaC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png" width="1456" height="1282" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1282,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:437449,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XdaC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 424w, https://substackcdn.com/image/fetch/$s_!XdaC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 848w, https://substackcdn.com/image/fetch/$s_!XdaC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 1272w, https://substackcdn.com/image/fetch/$s_!XdaC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9274843e-daac-4c19-add8-7102368a7fe3_1620x1426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Next we continue, doing the same calculations per connection. The next connection&#8217;s weights obtained in training:</p><pre><code>h&#8321; &#8594; output: w = 0.85
h&#8322; &#8594; output: w = 0.72
h&#8323; &#8594; output: w = -0.55</code></pre><p>Bias we reveal next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZmRX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZmRX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 424w, https://substackcdn.com/image/fetch/$s_!ZmRX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 848w, https://substackcdn.com/image/fetch/$s_!ZmRX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 1272w, https://substackcdn.com/image/fetch/$s_!ZmRX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZmRX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png" width="1456" height="1285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1285,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:473577,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZmRX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 424w, https://substackcdn.com/image/fetch/$s_!ZmRX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 848w, https://substackcdn.com/image/fetch/$s_!ZmRX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 1272w, https://substackcdn.com/image/fetch/$s_!ZmRX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef958f3-34b1-45df-8832-4bf0d874f479_1616x1426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><pre><code>Output bias: b = -0.20</code></pre><p>Now we perform the same computation again, this time from h1, h2 and h3 values computed earlier as our <strong>Input / Feature</strong> values:</p><pre><code>z = (0.85 * 0.300) + (0.72 * 0.289) + (-0.55 * 0) - 0.20 = 0.2631</code></pre><p>Now we apply another activation function, this time we choose to apply the <strong>sigmoid function</strong>, that squashes values to probabilities between [0,1]:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n&#963;(z) = \\frac{1}{1 + e^{-z}}\n\n&quot;,&quot;id&quot;:&quot;NHOWCXNOXV&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nksZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nksZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 424w, https://substackcdn.com/image/fetch/$s_!nksZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 848w, https://substackcdn.com/image/fetch/$s_!nksZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 1272w, https://substackcdn.com/image/fetch/$s_!nksZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nksZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png" width="1456" height="1277" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1277,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:484498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nksZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 424w, https://substackcdn.com/image/fetch/$s_!nksZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 848w, https://substackcdn.com/image/fetch/$s_!nksZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 1272w, https://substackcdn.com/image/fetch/$s_!nksZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6fdc513-de1e-43c3-85d2-4ab45216573c_1626x1426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Applying the sigmoid function to the obtained value <strong>0.263</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma(0.263) = \\frac{1}{1 + e^{-0.263}}\n\n= \\frac{1}{1 + 0.7688}\n              = \\frac{1}{1.7688}\n              \\approx 0.5655\n&quot;,&quot;id&quot;:&quot;EDUXUOVMKE&quot;}" data-component-name="LatexBlockToDOM"></div><p>We end up with the full prediction of the network for the first question, that was to know if the student passed or not given that the student studied 5 hours per day and attended 60% of classes: <strong>56.5%</strong>. So yes, the student would pass (barely).</p><p>You can see that choosing sigmoid as activating function in the output layer was not innocent. Even though the prediction only required some way of performing binary classification - pass / fail - the network&#8217;s output values are <strong>always continous</strong> - we will see why in a bit - and therefore we settle anything equal or above 0.5 as passing the exam, anything lower than that, means the student fails the exam.</p><p>As you can see, aside from some probably unfamiliar functions, the core network&#8217;s calculations are really simple to follow along. But what you are seeing is just an already trained network doing inference. But how was the network&#8217;s trained ?</p><h3><strong>Neural Network Training</strong></h3><p>This will be a bit trickier, but we will go through step by step as we did in inference. It&#8217;s very, very important that we nail these building blocks with very basic networks because from here things will get a lot more complex so we need to be very clear on the fundamentals.</p><p>Let&#8217;s dive in.</p><p>For training a neural network, or any other kind of model, we generally need <strong>three datasets</strong>: <strong>training</strong>, <strong>validation</strong> and <strong>test</strong> datasets. Usually they&#8217;re segmented over the total collection of observations of the reality we want to model, for instance: <strong>80% </strong>training, <strong>10%</strong> validation, <strong>10%</strong> test. For big data <strong>90%</strong> training, <strong>5% </strong>validation, <strong>5%</strong> test is also common. For all datasets we must sample from the same distribution:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Ptrain&#8203;(x,y)&#8776;Pval&#8203;(x,y)&#8776;Ptest&#8203;(x,y)&#8776;Preal&#8203;(x,y)&quot;,&quot;id&quot;:&quot;VVVQNEJPRA&quot;}" data-component-name="LatexBlockToDOM"></div><p>To put it in another way: <strong>validation </strong>and<strong> test </strong>are<strong> &#8220;small versions of reality&#8221;.</strong><br><strong>training</strong> is <strong>&#8220;what the model studies&#8221;</strong>. So one must strive to keep the same diversity found globally in the studied reality on each and every one of the datasets, otherwise we&#8217;re modeling different &#8220;realities&#8221; making the process unfruitful. This process is also called <em>stratified sampling</em>, A <strong>stratified sample</strong> is a way of splitting data so that <strong>each subset keeps the same structure as the full dataset</strong>. Instead of sampling randomly, we <strong>preserve the proportions of important groups</strong> (called <em>strata</em>).</p><p>The validation process is stateless, during validation the model is freezed and <strong>there are no weight updates</strong>, only in the training process weights get updated. In this regard, and generally speaking, the validation dataset is normally used to tune the training algorithm&#8217;s hyperparameters/early stopping - because you train the model over the training dataset with a specific hyperparameter config and then you validate this model/configuration agaisnt the validation dataset. It&#8217;s from this match that is calculated the<strong> validation loss</strong>, i.e., the average error the model makes on the <strong>validation dataset</strong> and is used to estimate how well the model generalizes, because this dataset was not seen in training data. The <strong>training loss</strong> is the same error, but is calculated over the <strong>training dataset</strong> and it is the quantity that is <strong>minimized during learning</strong>.</p><p>The training loss:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{train}(\\theta)\n= \\frac{1}{N_{train}} \\sum_{i=1}^{N_{train}} \n\\ell\\!\\left(y_i, f_\\theta(x_i)\\right)&quot;,&quot;id&quot;:&quot;NWTKRUVMPC&quot;}" data-component-name="LatexBlockToDOM"></div><p>The validation loss:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{val}(\\theta)\n= \\frac{1}{N_{val}} \\sum_{i=1}^{N_{val}} \n\\ell\\!\\left(y_i, f_\\theta(x_i)\\right)\n&quot;,&quot;id&quot;:&quot;BVEODVRIZJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ykFj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ykFj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 424w, https://substackcdn.com/image/fetch/$s_!ykFj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 848w, https://substackcdn.com/image/fetch/$s_!ykFj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ykFj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ykFj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png" width="404" height="167.40331491712706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:300,&quot;width&quot;:724,&quot;resizeWidth&quot;:404,&quot;bytes&quot;:37712,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ykFj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 424w, https://substackcdn.com/image/fetch/$s_!ykFj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 848w, https://substackcdn.com/image/fetch/$s_!ykFj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ykFj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d97d2da-dd0c-47d8-9661-fe80418e87e1_724x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>These are rather simple equations: we just apply the loss function <strong>to the real observed value</strong> - the training target - <strong>and the prediction of the network</strong> - that is, the complex function that receives an input <strong>x</strong> and gives an output <strong>f(x) </strong>by performing the linear calculations you saw earlier in the inference example<strong> </strong>- (we will see some loss functions in detail, particularly <strong>cross-entropy </strong>and <strong>mean squared error</strong> (MSE)). We sum all those loss values and divide by the number of elements in the dataset to get the average value.</p><p>We will keep training, trying to reduce both the training and the validation losses, that is, trying to minimize the errors the model obtains while trying to model the reality we&#8217;re interested in. Training loss normally goes down as the training progresses, but validation loss can down or go up. If that happens, it&#8217;s a clear signal of <strong>overfitting</strong>, that is, the model keeps reducing the loss over the training set, but is no longer generalizing well to unseen data. It&#8217;s being overfitted to the training set, modeling noise instead of signal. When this happens, the training process may stop early.</p><p>Once the model is trained and validated, it can finally be evaluated on the <strong>test</strong> dataset, a subset the model has never seen during the entire process, and conventional metrics are calculated: accuracy, precision, recall, F1-score, etc. The model training process ends here.</p><p>All of these concepts could also be applied to other machine learning algorithms, they&#8217;re not exclusive of neural networks, but are necessary to have a full picture of the scope of work. Let&#8217;s now see effectively how a neural network is actually trained. We will just focus on the core training mechanics, essentially seeing how weights get updated when processing a training dataset.</p><p>To go back to the same student problem we tackled before, consider this training dataset:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j3ji!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j3ji!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 424w, https://substackcdn.com/image/fetch/$s_!j3ji!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 848w, https://substackcdn.com/image/fetch/$s_!j3ji!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 1272w, https://substackcdn.com/image/fetch/$s_!j3ji!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j3ji!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png" width="1456" height="505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j3ji!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 424w, https://substackcdn.com/image/fetch/$s_!j3ji!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 848w, https://substackcdn.com/image/fetch/$s_!j3ji!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 1272w, https://substackcdn.com/image/fetch/$s_!j3ji!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03eeb439-0f57-4f40-a866-5e7dfd744321_1464x508.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here we see quite a few observations of the reality we are trying to model. We see a student that doesn&#8217;t study very much nor attends class (and inevitably fails the exam). We see an average student, an excellent student, and examples of both extremities: a student that studies but skips class, and vice-versa, one that attends but barely studies. If you notice, all of the cases have targeted observations: all have pass/fail outcomes. It&#8217;s as if this is an historical record of student behaviours and their respective exam outcomes. This is our observable reality, it&#8217;s by mining this list and showing to a network a comprehensive set of examples that try to describe in it&#8217;s more complete form the observable reality we want to model, that we effectively teach the network to discern the relation between hours of study, class attendance, and final exam outcome. And the network, with enough examples and training, will learn it automatically.</p><p>As we said earlier, for effectively training a model we also need a validation and test datasets, but for now we will only concentrate in the core training algorithm and on how the training process uses the training dataset.</p><p>So the training process will iterate on every example of the training dataset multiple times. We say that when a model has trained over the entire dataset - that is, that saw all the examples contained in the dataset once- that it completed <strong>1 epoch</strong>. Training often involves <strong>multiple epochs</strong> because error minimization and weight updating is an iterative process. It&#8217;s as if the network was studying a book and it read the whole book several times, not just one. More epochs means more learning, until it starts memorizing, which is not good, because the network stops learning the underlying patterns and starts learning data itself which leads to overfitting and thus poor generalization.</p><p>Let&#8217;s start by picking the first example of the dataset, the poor student, that studied 2 hours/day and only attended 30% of classes, having failed the exam. Normalized values are, respectively, x1 = 0.2 and x2 = 0.3 :</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2-4-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2-4-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 424w, https://substackcdn.com/image/fetch/$s_!2-4-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 848w, https://substackcdn.com/image/fetch/$s_!2-4-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!2-4-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2-4-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png" width="1456" height="1279" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1279,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:262788,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2-4-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 424w, https://substackcdn.com/image/fetch/$s_!2-4-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 848w, https://substackcdn.com/image/fetch/$s_!2-4-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!2-4-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7086a7ff-13a3-4e02-ba62-4db170cac8fa_1628x1430.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now we do a <strong>forward pass</strong>, that is, we perform all the previous calculations we saw earlier as if we were inferencing, this time with <strong>random weights</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ViYM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ViYM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 424w, https://substackcdn.com/image/fetch/$s_!ViYM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 848w, https://substackcdn.com/image/fetch/$s_!ViYM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 1272w, https://substackcdn.com/image/fetch/$s_!ViYM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ViYM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png" width="1456" height="1282" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1282,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:488496,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ViYM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 424w, https://substackcdn.com/image/fetch/$s_!ViYM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 848w, https://substackcdn.com/image/fetch/$s_!ViYM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 1272w, https://substackcdn.com/image/fetch/$s_!ViYM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1968fc7-0e65-44f7-a21f-cd5e02269627_1626x1432.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here the model predicted that a poor student would <strong>pass the final exam</strong>, because the network&#8217;s output was 0.533, that is &gt;= 0.5. But remember that the target score was 0, not 0.533, so the student, as observed, failed the final exam:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNOJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNOJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 424w, https://substackcdn.com/image/fetch/$s_!FNOJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 848w, https://substackcdn.com/image/fetch/$s_!FNOJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 1272w, https://substackcdn.com/image/fetch/$s_!FNOJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNOJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png" width="1456" height="1289" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1289,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:495525,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FNOJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 424w, https://substackcdn.com/image/fetch/$s_!FNOJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 848w, https://substackcdn.com/image/fetch/$s_!FNOJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 1272w, https://substackcdn.com/image/fetch/$s_!FNOJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dff06ca-9c52-4a96-baa2-4015cf3f6a14_1620x1434.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>So we have an error of -53.3%, the difference between the predicted value: 53.3%, and the target value the network should&#8217;ve predicted: 0 (fail). Now the idea is to <strong>update all the weights of the network in such a way that we minimize this error</strong>. In order to do that we will compute the training loss as we saw earlier and use a cross-entropy function, in this case, <strong>binary cross-entropy</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}\n= -\\left[\ny \\log(\\hat{p}) + (1 - y)\\log(1 - \\hat{p})\n\\right]\n&quot;,&quot;id&quot;:&quot;ZKGPAACOAJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Binary cross-entropy loss is tipically used in binary classification problems, but why use this function and not something else, and what is this thing really? Let&#8217;s take a look.</p><p>When we train a neural network for a <strong>binary classification problem</strong>, we are trying to answer a question that sounds extremely simple: <em>yes or no</em>. Did the student pass the exam or not? Is this email spam or not? Is this transaction fraudulent or legitimate?</p><p>Even though the final answer is binary, a neural network never works with hard &#8220;yes&#8221; or &#8220;no&#8221; decisions internally. Instead, it always produces <strong>continuous values</strong> (more on this in a bit). In the case of binary classification, the network usually outputs a number between 0 and 1. This number is best understood as a <strong>degree of belief</strong>. When the model outputs 0.9, it is saying &#8220;I believe there is a 90% chance this is a yes.&#8221; When it outputs 0.1, it is saying &#8220;I believe there is a 10% chance this is a yes.&#8221;</p><p>This is why the sigmoid function is so commonly used at the output layer: it guarantees that the network&#8217;s output stays between 0 and 1, which makes it naturally interpretable as a probability. However, there is an important distinction here. While the <strong>model outputs probabilities</strong>, the <strong>training data does not</strong>. The real world does not give us probabilities; it gives us <strong>outcomes</strong>. A student either passed or failed. An email is either spam or it isn&#8217;t. So the training labels are always discrete values: 0 or 1.</p><p>This type of situation is perfectly described by something called a <strong>Bernoulli experiment</strong>. A Bernoulli experiment is the simplest kind of random experiment imaginable. It has only two possible outcomes: success or failure, yes or no, 1 or 0. A Bernoulli distribution simply says that there is some probability <em>p</em> that the outcome is 1, and a probability <em>1 - p</em> that the outcome is 0. That&#8217;s all it is. In binary classification, the world generates Bernoulli outcomes, and the neural network&#8217;s job is to estimate the probability <em>p</em>.</p><p>At this point, we need a way to measure how good the model&#8217;s predictions are. This is where the concept of <strong>likelihood</strong> comes in. Likelihood answers a very intuitive question: <em>given what the model predicted, how plausible was what actually happened?</em> If a model predicts a 95% chance of passing and the student passes, that outcome feels very reasonable. If the model predicts a 2% chance of passing and the student passes, that outcome feels surprising. Likelihood is simply a way of quantifying that feeling of surprise.</p><p>During training, we want the model to make the observed outcomes as plausible as possible. In other words, we want to <strong>maximize the likelihood</strong> of the data under the model&#8217;s predictions. However, there is a practical issue: probabilities multiply when we look at many data points, and multiplying many small numbers quickly becomes numerically unstable. To solve this, we take the <strong>logarithm</strong> of the likelihood. <strong>Logarithms turn multiplications into additions</strong>, making the math more stable and easier to optimize. There is another benefit: probabilities close to zero turn into very large negative numbers when we take the log, which means confident mistakes are heavily penalized.</p><p>Because most optimization algorithms are designed to <em>minimize</em> a quantity rather than maximize it, we simply negate the log-likelihood. This gives us what is called the <strong>negative log-likelihood</strong>. In plain language, negative log-likelihood measures how surprised the model is by reality. The more surprised it is, the larger the loss.</p><p>Now comes the key connection. When the outcomes follow a Bernoulli distribution &#8212; meaning the labels are 0 or 1 &#8212; the negative log-likelihood simplifies into a very specific mathematical expression. That expression is what we call <strong>binary cross-entropy</strong>. For a single data point, it looks like what we already showed :</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}\n= -\\left[\ny \\log(\\hat{p}) + (1 - y)\\log(1 - \\hat{p})\n\\right]\n&quot;,&quot;id&quot;:&quot;GUETKBRPXV&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>This formula has a very intuitive interpretation. Let&#8217;s dissect it a bit where each symbol means:</p><ul><li><p><em>y</em> is the <strong>true label</strong> (either 0 or 1)</p><ul><li><p><em>y=1</em> means &#8220;YES / positive class&#8221;</p></li><li><p><em>y=0 </em>means &#8220;NO / negative class&#8221;</p></li></ul></li><li><p><em>p(hat)</em> is the model&#8217;s predicted <strong>probability that the label is 1</strong></p><ul><li><p>e.g. <em>p(hat)</em>&#8203;=0.90 means &#8220;90% chance it&#8217;s class 1&#8221;</p></li></ul></li></ul><p>Now let&#8217;s see how this formula behaves in the two possible cases.</p><p>If the true label is 1, that is, <em>y = 1,</em> the loss depends only on:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;-\\log(\\hat{p}) \n&quot;,&quot;id&quot;:&quot;ZYPFRKJYWP&quot;}" data-component-name="LatexBlockToDOM"></div><p>A high value of <em>p(hat)</em> means the log is close to 0, so -<em>log</em> is tiny which is good. But if <em>p(hat) </em>is low, the log is actually a very large negative number so -log is huge. <strong>You are punished if the model assigned a low probability to something that actually happened.</strong> </p><p>If the true label is 0, that is,<em> y = 0</em>, the loss depends on:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;-\\log(1 - \\hat{p})\n&quot;,&quot;id&quot;:&quot;BKQHTRVKGU&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here the reverse thing happens. If <em>p(hat)</em> is low, the result will be tiny. If it&#8217;s high, the result will be huge because log will approach zero. You are punished if the model was confident in the wrong direction. <strong>Confident and correct predictions lead to very small losses, while confident and wrong predictions lead to very large losses</strong>. This is exactly the behavior we want when training a classifier.</p><p>A word about the word <strong>entropy</strong> in &#8220;cross-entropy&#8221;. It comes from information theory, where entropy measures uncertainty or unpredictability. Low entropy means outcomes are easy to predict; high entropy means they are unpredictable. The word <strong>cross</strong> is there because we are comparing two different probability distributions: the true distribution that generated the data, and the distribution predicted by the model. Cross-entropy measures how inefficient it is to describe reality using the model&#8217;s probabilities instead of the true ones. Training a model by minimizing cross-entropy is equivalent to making the model&#8217;s predicted distribution match reality as closely as possible.</p><p>This is why binary cross-entropy is the standard loss function for binary classification. It matches the nature of the data, it trains the model to produce meaningful probabilities, it strongly penalizes confident mistakes, and it works perfectly with sigmoid outputs. In simple terms, binary cross-entropy teaches a model to be <strong>honest about uncertainty</strong> when the world only gives yes-or-no answers.</p><p>Now, remember when we said the output of the network are always continuous ? Well there&#8217;s another particularity that loss functions must follow: they must all be <strong>differentiable</strong>. But why?</p><p>Let&#8217;s first see what &#8220;differentiable&#8221; means: a function is differentiable if you can compute its derivative (slope) everywhere. Formally, this needs to exist:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{d}{dx} f(x)&quot;,&quot;id&quot;:&quot;MEQWGWTLCN&quot;}" data-component-name="LatexBlockToDOM"></div><p>That means:</p><ul><li><p>The function is <strong>smooth</strong></p></li><li><p>No sharp corners</p></li><li><p>No jumps</p></li><li><p>No flat pieces with no slope</p></li></ul><p>A brief note on derivative. A <strong>derivative</strong> measures how fast a function changes at a particular point. Intuitively, it tells us how sensitive the output of a function is to small changes in its input. Geometrically, the derivative at a point corresponds to the <strong>slope of the tangent line</strong> to the curve at that point. The tangent line is the straight line that just touches the curve at a single point and best approximates the function near that point. While the curve may bend and change direction, the tangent line captures its instantaneous direction and steepness exactly at that location.</p><p>In the context of neural networks, the loss function depends on many parameters, and derivatives tell us how sensitive the loss is to each weight or bias. By computing these derivatives, we understand how small adjustments to parameters affect the model&#8217;s error, which is precisely the information needed to guide learning. In this sense, derivatives provide the local compass that allows optimization algorithms to navigate the complex landscape of neural network training.</p><p>Back to &#8220;differentiable&#8221; concept.</p><p>The loss function needs to be differentiable because training employs the most common learning algorithm: <strong>gradient descent</strong>. It works by computing the derivative of the loss with respect to (<strong>wrt</strong>) each model parameter (or model weight, it&#8217;s the same) and then nudging those parameters in the direction that reduces the loss. If the loss function were not differentiable, there would be no reliable way to know whether a small change to a weight improves or worsens the model. Learning would simply not be possible.</p><p>Formally, gradient descent is defined as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\theta \\leftarrow \\theta - \\eta \\nabla_\\theta \\mathcal{L}&quot;,&quot;id&quot;:&quot;WTERLCTPXL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DsiM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DsiM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 424w, https://substackcdn.com/image/fetch/$s_!DsiM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 848w, https://substackcdn.com/image/fetch/$s_!DsiM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 1272w, https://substackcdn.com/image/fetch/$s_!DsiM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DsiM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png" width="1434" height="448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:448,&quot;width&quot;:1434,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63332,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DsiM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 424w, https://substackcdn.com/image/fetch/$s_!DsiM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 848w, https://substackcdn.com/image/fetch/$s_!DsiM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 1272w, https://substackcdn.com/image/fetch/$s_!DsiM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f5ea3e-3954-4aaf-b5f3-290ae04bad54_1434x448.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Great. But what is the &#8220;gradient&#8221; ? The gradient is a vector of partial derivatives that tells you how much the loss changes when each weight changes. Formally:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_\\theta \\mathcal{L}\n=\n\\begin{bmatrix}\n\\frac{\\partial \\mathcal{L}}{\\partial w_1} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_2} \\\\\n\\vdots \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial b}\n\\end{bmatrix}&quot;,&quot;id&quot;:&quot;QTSCQEWRKQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each partial derivative tells you: &#8220;If I move a tiny bit in this coordinate direction, does the function go up or down?&#8221;</p><p>When you combine them into a vector, you get the <strong>direction that increases the function the fastest</strong>.</p><p>But in the gradient descent equation notice the minus sign. The gradient points uphill, that is, the direction of the steepest increase:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_\\theta \\mathcal{L}&quot;,&quot;id&quot;:&quot;WUFFANRIPX&quot;}" data-component-name="LatexBlockToDOM"></div><p>But we want to <strong>go downhill</strong> (reduce the loss).</p><p>So we move <strong>in the opposite direction</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;- \\eta \\nabla_\\theta \\mathcal{L}&quot;,&quot;id&quot;:&quot;LHOGHWDOOD&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>To give you a clearer mental image, imagine you&#8217;re on a foggy mountain and want to reach the valley.</p><ul><li><p>You can&#8217;t see the whole landscape</p></li><li><p>You only feel the slope under your feet</p></li></ul><p>So you:</p><ol><li><p>Feel which way goes downhill (<strong>gradient</strong>)</p></li><li><p>Take a small step that way (<strong>learning rate</strong>, <em><strong>n</strong></em>)</p></li><li><p>Repeat</p></li></ol><p>Eventually you reach the bottom. And this works because the loss function is:</p><ul><li><p>Smooth</p></li><li><p>Differentiable</p></li><li><p>Shaped like a bowl</p></li></ul><p>We call the process of calculating the gradient <strong>backpropagation</strong>, and we call <strong>gradient descent</strong> the rule by which we actually update the weights using the gradient vector.</p><p>A neural network is a function with thousands or millions of parameters. The loss is therefore a function of many variables. Training the network means computing how sensitive the loss is to each parameter, that is, finding out the partial derivatives of the loss function with respect to each weight,  and form the gradient vector, which tells us how to move in parameter space to reduce the loss.</p><p>The way backpropagation helps achieve this is by applying a rule called the <strong>chain rule</strong>, that intuitively works like this: if something depends on something else, which depends on something else, then the rate of change is the product of all intermediate changes. Or in other words, the chain rule says that when a quantity depends on another through a sequence of transformations, <strong>its derivative is the product of the derivatives of each transformation</strong>. Neural networks are nothing more than long chains of functions, so backpropagation is simply the chain rule applied repeatedly.</p><p>But how do you effectively calculate the gradient vector ? Let&#8217;s continue.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!509F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!509F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 424w, https://substackcdn.com/image/fetch/$s_!509F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 848w, https://substackcdn.com/image/fetch/$s_!509F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 1272w, https://substackcdn.com/image/fetch/$s_!509F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!509F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png" width="1456" height="1280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1280,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:577169,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!509F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 424w, https://substackcdn.com/image/fetch/$s_!509F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 848w, https://substackcdn.com/image/fetch/$s_!509F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 1272w, https://substackcdn.com/image/fetch/$s_!509F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4cd645-7f07-4b84-b320-618ba325e6bc_1668x1466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As we were saying, the target is <strong>0</strong> and the predicted value is 0.533. So we substitute those two numbers, <strong>y </strong>and <strong>p(hat) </strong>in the binary cross-entropy equation when the true label is 0 :</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}\n= -\\log(1 - \\hat{p})\n= -\\log(1 - 0.533)\n\\approx -\\log(0.467)\n\\approx 0.761\n&quot;,&quot;id&quot;:&quot;UCVZWFZONK&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is the loss value. For sigmoid output with BCE, the derivative wrt the <strong>logit</strong> (that is, the raw score before sigmoid activation) is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\delta_y \\equiv \\frac{\\partial \\mathcal{L}}{\\partial z_y}\n= \\hat{p} - y\n&quot;,&quot;id&quot;:&quot;VNRQFSAEVW&quot;}" data-component-name="LatexBlockToDOM"></div><p>So:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\delta_y = 0.533 - 0 = 0.533&quot;,&quot;id&quot;:&quot;RJGERBEEIJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This number is the &#8220;error signal&#8221; at the output.</p><p>So what we will do now is to calculate, according to the chain rule, the derivatives of all the forward dependency chain, starting on the weights connecting the hidden nodes, the hidden nodes and their activations, the output weights, the prediction, and effectively the loss function:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w \\;\\rightarrow\\; z_h \\;\\rightarrow\\; h \\;\\rightarrow\\; z_y \\;\\rightarrow\\; \\hat{p} \\;\\rightarrow\\; \\mathcal{L}\n&quot;,&quot;id&quot;:&quot;BSJFSLPRJY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Formally:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w}\n=\n\\frac{\\partial \\mathcal{L}}{\\partial \\hat{p}}\n\\cdot\n\\frac{\\partial \\hat{p}}{\\partial z_y}\n\\cdot\n\\frac{\\partial z_y}{\\partial h}\n\\cdot\n\\frac{\\partial h}{\\partial z_h}\n\\cdot\n\\frac{\\partial z_h}{\\partial w}&quot;,&quot;id&quot;:&quot;NNIOFWCAXM&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>So let&#8217;s begin by calculating the gradients for the weights that connect the hidden layer&#8217;s nodes <em>h1</em>, <em>h2</em> and <em>h3,</em> and the output.</p><p>For each output weight the derivative is like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{h_j \\to y}}\n= \\delta_y \\cdot h_j&quot;,&quot;id&quot;:&quot;OGGWDJNBRH&quot;}" data-component-name="LatexBlockToDOM"></div><p>And for the output <em>bias</em>, the derivative is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial b_y}\n= \\delta_y\n&quot;,&quot;id&quot;:&quot;HNDFTFVGCO&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>So the calculated gradients for all the hidden &#8594; output nodes:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{h_1 \\to y}}\n= 0.533 \\cdot 0.064\n= 0.034112\n&quot;,&quot;id&quot;:&quot;MJGOOKYMMX&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{h_2 \\to y}}\n= 0.533 \\cdot 0.000\n= 0\n&quot;,&quot;id&quot;:&quot;UUQDYTIJLB&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{h_3 \\to y}}\n= 0.533 \\cdot 0.074\n= 0.039442\n&quot;,&quot;id&quot;:&quot;PLLTMKETLH&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>We now calculate the gradients for hidden nodes. Don&#8217;t forget that hidden nodes have a ReLU activation function. First derivative also includes it, and is like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\delta_{h_j}\n\\equiv\n\\frac{\\partial \\mathcal{L}}{\\partial z_{h_j}}\n=\n\\delta_y \\cdot w_{h_j \\to y} \\cdot \\mathrm{ReLU}'(z_{h_j})\n&quot;,&quot;id&quot;:&quot;XJMVELNVSG&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The ReLU derivative being:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathrm{ReLU}'(z) =\n\\begin{cases}\n1, &amp; z > 0 \\\\\n0, &amp; z \\le 0\n\\end{cases}\n&quot;,&quot;id&quot;:&quot;PMKCMBSODR&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>The ReLU derivative for each hidden layer neuron is like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_{h_1} = 0.064 > 0\n\\;\\Rightarrow\\;\n\\mathrm{ReLU}'(z_{h_1}) = 1\n&quot;,&quot;id&quot;:&quot;IOJDQOGRIQ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_{h_2} = -0.061 \\le 0\n\\;\\Rightarrow\\;\n\\mathrm{ReLU}'(z_{h_2}) = 0\n&quot;,&quot;id&quot;:&quot;KRYYNQUOZJ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_{h_3} = 0.074 > 0\n\\;\\Rightarrow\\;\n\\mathrm{ReLU}'(z_{h_3}) = 1\n&quot;,&quot;id&quot;:&quot;WSFHUOHKFU&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Gradient values for hidden nodes:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\delta_{h_1}\n= 0.533 \\cdot 0.20 \\cdot 1\n= 0.1066\n&quot;,&quot;id&quot;:&quot;IRTYJTQAWU&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\delta_{h_2}\n= 0.533 \\cdot (-0.15) \\cdot 0\n= 0&quot;,&quot;id&quot;:&quot;ICCFSBYHCW&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\delta_{h_3}\n= 0.533 \\cdot 0.25 \\cdot 1\n= 0.13325\n&quot;,&quot;id&quot;:&quot;UMAVUBKEAX&quot;}" data-component-name="LatexBlockToDOM"></div><p>We now finally calculate the gradients for all the weights that connect the input nodes and the hidden layer&#8217;s nodes.</p><p>The first derivative is for the input weights is like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{x_i \\to h_j}}\n=\n\\delta_{h_j} \\cdot x_i\n&quot;,&quot;id&quot;:&quot;QLLWAVASHH&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>And for the hidden nodes bias:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial b_{h_j}}\n=\n\\delta_{h_j}&quot;,&quot;id&quot;:&quot;NTNEPNQMKS&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>So the gradients for the weights connecting to the hidden neuron <em>h1</em> are calculated like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{x_1 \\to h_1}}\n= 0.1066 \\cdot 0.2\n= 0.02132&quot;,&quot;id&quot;:&quot;CIMMUUKWLZ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{x_2 \\to h_1}}\n= 0.1066 \\cdot 0.3\n= 0.03198\n&quot;,&quot;id&quot;:&quot;DKBZGSIKWF&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial b_{h_1}}\n= 0.1066\n&quot;,&quot;id&quot;:&quot;CQJRAZCZBS&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>And the gradients for the weights connecting to the hidden neuron <em>h2</em> are calculated like this: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\delta_{h_2} = 0\n\\quad\\Rightarrow\\quad\n\\text{all gradients for } h_2 \\text{ are zero}\n&quot;,&quot;id&quot;:&quot;RHIMCGPHOO&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Finally,  the gradients for the weights connecting to the hidden neuron <em>h3</em> are calculated like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{x_1 \\to h_3}}\n= 0.13325 \\cdot 0.2\n= 0.02665\n&quot;,&quot;id&quot;:&quot;HUDZASOGJX&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial w_{x_2 \\to h_3}}\n= 0.13325 \\cdot 0.3\n= 0.039975\n&quot;,&quot;id&quot;:&quot;GSRHLKLCAI&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial b_{h_3}}\n= 0.13325&quot;,&quot;id&quot;:&quot;LFLHBYZQVG&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Having calculated all the gradients, we now perform the gradient descent update on all weights.</p><p>Recalling the formula, the gradient descent update rule is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\theta \\leftarrow \\theta - \\eta \\nabla_\\theta \\mathcal{L}&quot;,&quot;id&quot;:&quot;PGAEQQLUDU&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>So each weight updates like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w \\leftarrow w - \\eta \\frac{\\partial \\mathcal{L}}{\\partial w}\n&quot;,&quot;id&quot;:&quot;WMNFJUXZJS&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Considering <strong>the learning rate as 0.5</strong>, here are the hidden nodes &#8594; output weight updates:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{h_1 \\to y}\n= 0.20 - 0.5 \\cdot 0.034112\n= 0.20 - 0.017056\n= 0.182944\n&quot;,&quot;id&quot;:&quot;UIRWWVUHOG&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{h_2 \\to y} = -0.15\n&quot;,&quot;id&quot;:&quot;NZZJWCCOAZ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{h_3 \\to y}\n= 0.25 - 0.5 \\cdot 0.039442\n= 0.25 - 0.019721\n= 0.230279\n&quot;,&quot;id&quot;:&quot;GJLSCUGFNW&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;b_y = 0.10 - 0.5 \\cdot 0.533\n= 0.10 - 0.2665\n= -0.1665&quot;,&quot;id&quot;:&quot;XTSWTVPQZL&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Now, for the hidden node <em>h1</em> updates:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{x_1 \\to h_1}\n= 0.15 - 0.5 \\cdot 0.02132\n= 0.15 - 0.01066\n= 0.13934\n&quot;,&quot;id&quot;:&quot;NXVRGQHJWK&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{x_2 \\to h_1}\n= -0.22 - 0.5 \\cdot 0.03198\n= -0.22 - 0.01599\n= -0.23599\n&quot;,&quot;id&quot;:&quot;YDPIIZPKRO&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;b_{h_1}\n= 0.10 - 0.5 \\cdot 0.1066\n= 0.10 - 0.0533\n= 0.0467\n\n&quot;,&quot;id&quot;:&quot;LZUJCXBVCD&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The hidden <em>h2</em> suffers no updates:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{x_1 \\to h_2} = -0.18&quot;,&quot;id&quot;:&quot;EZIAZKMVRC&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{x_2 \\to h_2} = 0.25\n&quot;,&quot;id&quot;:&quot;GENMMMIEGU&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;b_{h_2} = -0.10\n&quot;,&quot;id&quot;:&quot;HXZXWYGVYX&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>And finally the hidden <em>h3</em> node updates:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{x_1 \\to h_3}\n= 0.30 - 0.5 \\cdot 0.02665\n= 0.30 - 0.013325\n= 0.286675\n\n&quot;,&quot;id&quot;:&quot;TZWONIAMAI&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_{x_2 \\to h_3}\n= -0.12 - 0.5 \\cdot 0.039975\n= -0.12 - 0.0199875\n= -0.1399875\n\n&quot;,&quot;id&quot;:&quot;OJBRVCPCUG&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;b_{h_3}\n= 0.05 - 0.5 \\cdot 0.13325\n= 0.05 - 0.066625\n= -0.016625\n\n&quot;,&quot;id&quot;:&quot;NEDXAWAHLK&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>All of these calculations are in fact what the gradient vector is all about: all the individual gradients stacked together. Formally:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_{\\theta}\\mathcal{L}\n=\n\\begin{bmatrix}\n\\frac{\\partial \\mathcal{L}}{\\partial w_{h_1 \\to y}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{h_2 \\to y}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{h_3 \\to y}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial b_y} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{x_1 \\to h_1}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{x_2 \\to h_1}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial b_{h_1}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{x_1 \\to h_2}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{x_2 \\to h_2}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial b_{h_2}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{x_1 \\to h_3}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial w_{x_2 \\to h_3}} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial b_{h_3}}\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;LJVHKPLEIY&quot;}" data-component-name="LatexBlockToDOM"></div><p>And with filled in values:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_{\\theta}\\mathcal{L}\n=\n\\begin{bmatrix}\n0.034112 \\\\\n0 \\\\\n0.039442 \\\\\n0.533 \\\\\n0.02132 \\\\\n0.03198 \\\\\n0.1066 \\\\\n0 \\\\\n0 \\\\\n0 \\\\\n0.02665 \\\\\n0.039975 \\\\\n0.13325\n\\end{bmatrix}&quot;,&quot;id&quot;:&quot;EFRKCAWZYG&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>So the gradient descent in vector form is like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\theta_{\\text{new}}\n=\n\\theta_{\\text{old}}\n-\n\\eta\n\\begin{bmatrix}\n0.034112 \\\\\n0 \\\\\n0.039442 \\\\\n0.533 \\\\\n0.02132 \\\\\n0.03198 \\\\\n0.1066 \\\\\n0 \\\\\n0 \\\\\n0 \\\\\n0.02665 \\\\\n0.039975 \\\\\n0.13325\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;GEXKLOSVSV&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Whew, and finally after all these calculations, the weights are updated with all these calculated values respectively. Now when the training process picks up a second sample - like the one below, an &#8220;average student&#8221;, where <strong>x1 = 0.5</strong> and <strong>x2 = 0.6</strong> -  the forward pass of this second sample already shows weights updated by the previous training process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fi4n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fi4n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 424w, https://substackcdn.com/image/fetch/$s_!Fi4n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 848w, https://substackcdn.com/image/fetch/$s_!Fi4n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!Fi4n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fi4n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png" width="1456" height="1285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1285,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:515233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.dimensionalityreduction.com/i/182271606?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fi4n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 424w, https://substackcdn.com/image/fetch/$s_!Fi4n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 848w, https://substackcdn.com/image/fetch/$s_!Fi4n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!Fi4n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4994e148-912e-4383-a5eb-e6e6e10a4d96_1656x1462.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you watch closely and compare with the weights in the beginning, they&#8217;ve changed! Each weight was updated in proportion to how much it contributed to the error. Active neurons update their weights, inactive neurons remain unchanged, and the model moves step by step toward lower loss.</p><p>And from here on the network would continue training and would iterate on all the available samples, sometimes more than once, that is, more than <strong>1 epoch</strong>.</p><p>In theory, gradient descent computes the gradient using the entire training dataset. In practice, this is computationally expensive. <strong>Stochastic Gradient Descent</strong> (SGD) approximates the true gradient by computing it on a small random subset of the data. If you have a million points as training dataset, you would pick a batch size, let&#8217;s say 128. Then it would pick 128 samples randomly from the entire dataset (that&#8217;s why it&#8217;s called stochastic) and average them. Only then it would update the weights. This introduces noise into the optimization process, but dramatically speeds up training and often improves generalization. Most modern neural networks are trained using mini-batch SGD or adaptive variants such as Adam.</p><p></p><p></p><p>Well and this is it for Part 1. This post about LLM&#8217;s from scratch doesn&#8217;t have a lot of LLM specifics really, but these basic ANN building blocks are crucial for understanding what lies ahead. Next time we will do a deep dive on what kind of ANN&#8217;s LLM&#8217;s are built on and how are they effectively trained.</p><p>Until next time!</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.dimensionalityreduction.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dimensionality Reduction! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>