Skip to content

09. Artificial Intelligence, Machine Learning, Deep Learning

Nilufar Ismayilova, Rumiyya Alili, Ismayil Shahaliyev
Nov 29, 2025    Jan 31, 2026

Artificial Intelligence (AI) refers to the field concerned with building computer systems that exhibit intelligent behavior. An AI system is one that can reason, plan, make decisions, solve problems, or act purposefully in an environment in pursuit of goals. The defining feature of AI is what the system does, not how it is implemented. AI systems do not need to learn from data: classical rule-based expert systems, symbolic logic engines, search and planning algorithms, and game-playing systems based on handcrafted rules are all examples of AI. Learning may be used in AI, but it is not required.

How deep learning, machine learning and artificial intelligence are related.
How deep learning, machine learning and artificial intelligence are related. By Original file: Avimanyu786SVG version: Tukijaaliwa - File:AI-ML-DL.png, CC BY-SA 4.0, Link

Machine Learning is a methodological approach in which systems learn patterns or functions from data instead of being programmed with explicit rules. The central idea of machine learning is performance improvement through experience. Given examples, the system infers statistical regularities that allow it to make predictions or decisions on new data. Many machine learning systems perform narrow tasks such as regression, classification, clustering, or recommendation without reasoning, planning, or goal-directed behavior.

Deep Learning is a specific class of machine learning methods based on multi-layer neural networks. Deep learning is especially effective for perceptual tasks such as image recognition, speech processing, and natural language understanding because it can automatically learn features directly from raw inputs. Deep learning is neither synonymous with intelligence nor with learning in general; it is a particular computational technique for large-scale function approximation within machine learning.

Note

Imagine an online clothing store that wants to show each customer personalized outfit suggestions. Instead of manually programming rules such as "if user clicks blue shirts, recommend jeans", the system can collect thousands of users' behaviors: items they viewed, purchased, returned, or saved to wishlist. Machine learning models then discover patterns, for example, customers who prefer minimalistic styles often buy neutral-colored items. Deep learning models can even analyze product images and understand patterns like "striped," "oversized," or "casual." As a result, the AI system learns to automatically recommend items that match each person's style, even if no human explicitly defined the rules.

Tip

You can read more about deep learning on the website dedicated to the CSCI 4701: Deep Learning course taught at ADA University.

Linear Regression Algorithm

Training a machine learning model
Training a machine learning model.

Linear Regression is one of the simplest machine learning algorithms. We assume that there is a linear relationship between the observed data and the prediction we wish to make. In that case, we can model the relationship between two variables by finding a straight line that best fits the data. The goal is to predict an output \(ŷ\) based on an input \(x\) using the function: $$ \hat{y} = wx + b $$

Where \(x\) is the observed known input, \(w\) is the weight (slope) and \(b\) is the bias (intercept) of the function. The task of machine learning is to find the values of \(w\) and \(b\) that make the function's outputs match the ground-truth value (label) as closely as possible. Learnable values of a function are called model parameters. For comparison, GPT-3 model has 175 billion parameters, which were learned by training on a very large corpus of text collected from the World Wide Web and other publicly available sources.

Note

Suppose a company wants to predict a student's final exam score based on the number of hours they studied. It collects real data from previous students:

Study Hours (x) 5 10 13 20
Final Score (y) 45 55 82 94

Let's say we provide random values for \(w\) and \(b\) to be \(4.3\) and \(12\). Then our model will make predictions according to the function: $$ \hat{y} = 4.3x + 12 $$ Then a student who studies \(19\) hours is predicted to score: $$ 4.3(19) + 12 = 93.7 $$ The function our model uses is close to reality: we haven't observed any student studying for \(19\) hours before (we don't have it in our table), yet our model predicted a close performance similar to the observed student, who studied for \(20\) hours with an observed ground-truth label of \(94\) exam score.

Root Mean Squared Error (RMSE)

When training a machine learning model, we need a way to measure how well it is performing. This measurement is called the loss or error, and it tells us how far our predictions are from the actual values. The goal of training is to minimize this loss. However, calculating error is not as simple as it might seem. Consider this problem with this example dataset:

Study Hours (x) Actual Score [label] (y) Predicted Score (ŷ) Error (y − ŷ)
5 50 45 5
10 55 55 0
20 90 95 −5

We calculated individual errors of our predictions by simply subtracting predicted scores from ground-truth values to see how much our model's output diverges from reality. But how can we measure average quality of our model's predictive power? A simple approach is to find arithmetic mean: $$ \frac{5 + 0 + (-5)}{3} = 0 $$ The mean error is zero, suggesting our model is perfect. But clearly, the model made errors on two predictions.The problem is that positive and negative errors cancel each other out.

Note

How would you avoid negative values cancelling each other out during averaging?

To avoid this cancellation problem, we can take the absolute value of each error before averaging. Mean Absolute Error (MAE) will have the following solution: $$ \mathrm{MAE} = \frac{\lvert 5 \rvert + \lvert 0 \rvert + \lvert -5 \rvert}{3} = \frac{10}{3} = 3.33 $$

This is better because now we don't hide errors through cancellation. However, MAE treats all errors equally. A 10-point error and a 2-point error contribute proportionally to the total loss. To penalize large errors more heavily, we can square each error before averaging. Mean Squared Error (MSE) will have the following solution: $$ \text{MSE} = \frac{5^2 + 0^2 + (-5)^2}{3} = \frac{25 + 0 + 25}{3} = 16.67 $$

By squaring, large errors become disproportionately larger: an error of 2 becomes 4 (doubled), an error of 10 becomes 100 (10× larger). This encourages the model to avoid making big mistakes. A problem with MSE, however, is that the units are squared. If we are predicting "scores", MSE gives us "scores²", which is not intuitive. To bring the error back to the original units, we take the square root: Root Mean Squared Error (RMSE) will have the following solution: $$ RMSE = \sqrt MSE = \sqrt 16.67 = 4.08 $$ Now our error is back in the original units, making it easier to interpret.

Artificial Neural Networks

Artificial Neural Networks (ANNs) are inspired by how biological neurons in the human brain process information. Just like real neurons fire (get activated) and send electrical signals to the connected neurons, artificial neurons pass numerical values forward through the network.

Biological Neuron.
Biological Neuron. By User:Dhp1080 - "Anatomy and Physiology" by the US National Cancer Institute's Surveillance, Epidemiology and End Results (SEER) Program ., CC BY-SA 3.0, Link

To understand neural networks, it is helpful to start with a single neuron. An artificial neuron is a simple mathematical function. It takes one or more input values, multiplies each input by a corresponding weight, adds a bias, and produces an output. In its basic form, this is exactly a linear function with n inputs (e.g. study hours, amount of sleep, etc). $$ z = w_1x_1 + w_2x_2 + \dots + w_nx_n + b $$ This linear combination is then passed through a nonlinear function called an activation function. For example, using the logistic (sigmoid) function, the final prediction of the neuron is: $$ \hat{y} = \frac{1}{1 + e^{-z}} $$ Note that we only did one additional step. Without the sigmoid function, our predictions would be based on a linear formula. We simply assume that our data is complicated or our data has a non-linear relationship between its inputs and output. It can also be noted that as the output of the linear function \(z\) approaches positive infinity, our prediction will approach 1, and in case of negative infinity, \(0\). Hence, the sigmoid function is limited to be in the range of [0,1] making our prediction probabilistic.

Artificial Neuron.
Artificial Neuron. By Funcs - Own work, CC0, Link

A neural network is simply a collection (network) of such neurons. In its simplest form, multilayer perceptron is when an output of one neuron is an input of a neuron in the next layer. In the figure, the described neural network has three layers with nine neurons in total, where the output layer neurons make two predictions (e.g. probability of seeing a "cat", as well as a "dog" in the provided input image).

Neural Network.
Neural Network. By Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, Link

Note

Suppose we want to identify handwritten numbers (0–9) from images. The network has 10 output neurons, one for each digit. The neuron with the highest probability is the network's prediction.

Digit 0 1 2 3 4 5 6 7 8 9
Probability 0.01 0.02 0.01 0.00 0.01 0.10 0.82 0.01 0.01 0.01

The model predicts the image is a 6 with 82% confidence.

Matrix Multiplication

Matrix multiplication is one of the most important mathematical operations in machine/deep learning because it is the basic way neural networks process and transform information. A matrix is simply a rectangular grid of numbers, and multiplying two matrices means combining the rows of the first matrix with the columns of the second to produce new values. In a neural network, every layer uses matrices to store the weights that connect one layer of neurons to the next. When data flows through the network, the input values are represented as a vector, and this vector is multiplied by the weight matrix of the layer. The result of this multiplication is a new vector that represents the layer's output.

All of these transformations, often millions of them in a single deep network, depend on matrix multiplication. Because neural networks require so many large matrix operations, hardware that can perform parallel computation is extremely valuable. This is why Graphics Processing Units (GPU) are widely used in machine/deep learning. Unlike CPUs, which process information sequentially, GPUs can perform thousands of small multiplications at the same time, making them ideal for the heavy matrix workloads found in modern AI systems.

Note

Suppose an image classifier needs to determine whether an image contains a cat or a dog. Each image is converted into a vector (list of numbers). If the input image has three features (simplified for explanation): $$ X = \begin{bmatrix} 0.2 \\ 0.9 \\ 0.4 \end{bmatrix} $$ And the neural network layer has 3 neurons with weights: $$ W = \begin{bmatrix} 0.5 & 0.2 & 0.1 \\ 0.9 & 0.3 & 0.5 \\ 0.1 & 0.8 & 0.6 \end{bmatrix} $$

Then the output is simply: $$ W \times X = \begin{bmatrix} 0.5(0.2) + 0.2(0.9) + 0.1(0.4) \\ 0.9(0.2) + 0.3(0.9) + 0.5(0.4) \\ 0.1(0.2) + 0.8(0.9) + 0.6(0.4) \end{bmatrix} $$

The first neuron output: \(0.1 + 0.18 + 0.04 = 0.32\)
The second neuron output: \(0.18 + 0.27 + 0.20 = 0.65\)
The third neuron output: \(0.02 + 0.72 + 0.24 = 0.98\)

So the final output vector is: $$ Y = \begin{bmatrix} 0.32 \\ 0.65 \\ 0.98 \end{bmatrix} $$

These resulting numbers feed into activation functions and ultimately decide the probability it is a cat or a dog. This multiplication happens millions of times per second, which is why GPUs are essential.

Tip

For further information, see the content dedicated to the linear algebra.

Large Language Models (LLM)

Large Language Models (LLMs), such as GPT-based systems, are advanced deep learning models trained on extremely large collections of text. Through this training, they learn statistical patterns in how words, sentences, etc. relate to each other. Because of this, they can generate coherent text, answer questions, summarize long documents, translate between languages, and even write computer code. These abilities come from recognizing patterns across billions of examples, rather than being explicitly programmed with rules.

LLMs are built using a special neural network architecture called the Transformer. The key component of this architecture is the attention mechanism, which enables the model to process all parts of the input in parallel. Instead of reading text sequentially from left to right, the model considers the entire sentences at once and computes how strongly each word is related to every other word. Because all words are processed simultaneously, transformers can efficiently model long-range dependencies without relying on step-by-step recurrence. This parallel structure makes training much faster on modern hardware and allows the model to capture relationships between distant words that would be difficult to retain in sequential models. As a result, large language models can represent context more accurately and generate coherent, context-aware text over long sequences.

Note

Consider a customer service chatbot for an internet provider. Customers commonly ask: "My Wi-Fi is slow.", "Why is my internet disconnecting?", "How do I change my password?"

The company trains an LLM on: - thousands of past customer chat logs - network troubleshooting guides - modem/router manuals - internal support responses

When a customer types: "My internet randomly disconnects every hour", the sentence is first converted into a sequence of numerical vectors called token embeddings. Each token (such as internet, disconnects, hour) is represented as a high-dimensional vector (list of numbers).

These vectors are then processed in parallel by multiple attention layers. In each layer, the model computes how strongly every token relates to every other token using matrix operations. For example, the token disconnects may receive high attention weights from internet and hour, indicating a meaningful statistical relationship learned during training.

Finally, based on these transformed representations, the model computes a probability distribution over the next possible words and generates a response token by token. The output appears coherent because the model has learned, from large-scale text data, which word sequences are statistically likely to follow similar inputs.

Nowadays, AI is often treated as synonymous with neural network-based models (deep learning). This identification is misleading. Deep learning is a powerful tool within AI, not a definition of AI itself. Long before neural networks dominated the field, many AI systems were built using symbolic reasoning, search, and optimization methods that involve no learning from data at all.

Examples include constraint satisfaction problems (CSPs), where the goal is to find assignments that satisfy a set of logical constraints; minimax search, used in game-playing agents to choose optimal moves by exploring possible future states; and reinforcement learning methods (such as Q-learning), which learn action values through interaction with an environment without relying on deep neural networks. These approaches remain central to AI and are still widely used in planning, scheduling, verification, and control.

CSP Backtracking Algorithm

Constraint Satisfaction Problems (CSPs) require assigning values to a set of variables in a way that satisfies all given rules. Each variable has a domain of possible values, and constraints specify which combinations of assignments are allowed. Backtracking is a systematic method for exploring all possible assignments. It works by choosing a variable, assigning it a value, and immediately checking whether this partial assignment violates any constraints. If the assignment is still valid, the algorithm moves to the next variable. If a violation appears at any point, the algorithm reverses its last step, removes the invalid value, and tries a different one.

This process continues until either a complete valid assignment is found or all options have been exhausted. Backtracking is efficient because it avoids exploring entire branches of the search space as soon as a conflict is detected, instead of waiting until the end.

Note

Consider the classic Sudoku puzzle. Each empty cell must contain a number from 1-9 such that:

  • No duplicate in the same row
  • No duplicate in the same column
  • No duplicate in the 3×3 grid

Backtracking works like this:

  1. Start at the first empty cell.
  2. Try placing "1" → check rules.
  3. If no rule is violated → move to next empty cell.
  4. If a violation occurs → remove the number (backtrack) and try "2", "3", ..., "9".
  5. If all numbers fail → backtrack to previous cell.

This continues until the puzzle is complete. This same method is used in timetabling at universities, resource allocation, map coloring, and validating exam seating plans.

Minimax Algorithm

Minimax is an algorithm used in two-player, turn-based games where players take actions one after another, such as tic-tac-toe, chess, or checkers. The main idea is that one player is trying to choose moves that maximize their chance of winning, while the opponent is assumed to choose moves that minimize that chance. To decide the best move, the algorithm looks ahead at all possible future game states by simulating every move the player could make, every response the opponent could make, and so on. Each final outcome (win, loss, draw) is given a numerical score. Wins receive positive scores, losses receive negative scores, and draws get a neutral score.

Once all these possible future outcomes are evaluated, the algorithm works backward through the game tree. At layers where it is our turn, Minimax selects the move that leads to the highest possible score. At layers where the opponent moves, the algorithm assumes the opponent will choose the option that gives us the lowest score. Following this alternating pattern of maximizing and minimizing ensures that the chosen move is the one that guarantees the best possible outcome even if the opponent plays perfectly.

Note

Assume it is the AI's turn and the board is following for tic-tac-toe:

X O X
O X
O

The AI (playing "X") considers every possible move. If it plays in the bottom-left corner, what are all the opponent's replies? If it plays in the middle-right cell, does it allow the opponent to win? If it plays incorrectly, does it allow a future fork? Minimax will:

  • Assign +1 if the move leads to a guaranteed win
  • Assign 0 if it leads to a draw
  • Assign -1 if it leads to a forced loss

The AI compares the scores of all possible futures. If the opponent can force a win, Minimax will avoid that move. This ensures the AI never loses, even against a perfect human player.

Chess engines, Connect-4, and even some negotiation bots in economics use this principle. The highest rated and open-source chess engine Stockfish uses an advanced minimax search (now also combined with a neural network after its loss to AlphaZero).

Reinforcement Learning

Reinforcement Learning (RL) is a branch of AI in which a system, called an agent, learns how to behave by interacting with an environment rather than by being shown the correct answers. The agent chooses an action, observes what happens as a result, and receives feedback in the form of a numerical reward or penalty. Good actions lead to positive rewards, while harmful or ineffective actions lead to negative ones. Over time, by repeatedly trying different actions and observing their consequences, the agent gradually learns which behaviors produce the highest long-term benefit. This learning process does not rely on explicit instructions. Instead, the agent discovers the best strategy on its own by balancing exploration of new actions with exploitation of actions that have worked well in the past.

Key Concepts:

  • Agent - the learner (a robot, a program, a game player)
  • Environment - the world the agent interacts with
  • Action - something the agent chooses to do
  • Reward - numerical feedback (positive or negative)
  • Policy - the strategy the agent learns

Note

Imagine training a robot to walk. At first, the robot moves randomly because it does not know what actions are helpful. It might take a step and fall, which produces a penalty, or it might shift its weight successfully and stay upright, which produces a reward. As these interactions continue, the robot begins to understand which sequences of movements keep it balanced and which cause it to fall. Eventually, after enough trial and error, the robot develops a stable walking pattern simply because that pattern consistently leads to higher rewards than others. In this way, RL enables systems to learn complex behaviors directly from experience rather than from predefined rules.

Additional Material