Machine-Learning

Pages in this section

  • Accuracy
    Last edited: 2026-02-05

    Accuracy

    Suppose we some model $\hat{f}$ predicting $f$. For some test data $T$ we define

  • Activation function
    Last edited: 2026-02-05

    Activation function

    An activation function $a: \mathbb{R} \rightarrow \mathbb{R}$ gets applied in a perceptron after the weighted sum of the inputs. It is the non-linear term. Classic activation functions are

  • Bayeses optimal classifier
    Last edited: 2026-02-05

    Bayesian classification

    Suppose we have a classification problem for some function $f: A \rightarrow B$. We have a hypothesis space $H$ and training data $T$. We want to work out what the best label is for $a \in A$. We can use maximum a posteriori probability to calculate Bayeses optimal classifier

  • Binary step
    Last edited: 2026-02-05

    Binary step

    The binary step is an activation function that determines what the value is in reference to a parameter $\theta$.

  • Calculate polynomial regression coefficients for MSE
    Last edited: 2024-01-17

    To demonstrate the calculation of the coefficients for polynomial regression with MSE suppose we have 1-dimensional input $A$ and training data $(x_i, y_i)$. For an order $k$ polynomial we are looking for coefficients $c_j$ for $0 \leq j \leq k$ that roughly do the following

    $$ \left( \begin{array} \ 1 & x_1 & x_1^2 & \ldots & x_1^k\\ 1 & x_2 & x_2^2 & \ldots & x_2^k\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_n & x_n^2 & \ldots & x_n^k \\\end{array} \right) \left( \begin{array} \ c_1\\ c_2\\ \vdots\\ c_k \\\end{array} \right) \approx \left( \begin{array} \ y_1\\ y_2\\ \vdots\\ y_n \\\end{array} \right), \ \ \ \ \mbox{we call this} \ \ \ \ X C \approx Y $$

    in a completely not mathematically sound way we can do the following rearrangement

  • Classification problems
    Last edited: 2026-02-05

    Classification problems

    Classification problems are a subset of supervised learning problems where the codomain $B$ is a finite set.

  • Clustering Problem
    Last edited: 2026-02-05

    Clustering problem

    Suppose we are in the unsupervised learning setting and we have a sense of distance on $A$, $d(\cdot, \cdot)$ (we assume $d(x,y) = d(y,x)$ for $x,y \in A$). We want to find $B$ a set of categories and a function $f: A \to B$ such that $f$ maps items that are close to one another to the same category.

  • Concept
    Last edited: 2026-02-05

    Concept

    Suppose we have a set $A$ then a concept about $A$ is a function $f: A \rightarrow \{0,1\}$. I.e. tall people would map people who were considered tall to true and the rest to false.

  • Concept class
    Last edited: 2026-02-05

    Concept class

    Given some underlying set $A$ a concept class is a set of concepts

  • Consistent clustering
    Last edited: 2026-02-05

    Consistent clustering

    Suppose we are in the clustering problem set up. A clustering algorithm $C$ is considered consistent if for distance $d$ on $T$ we get clustering $C(d) = f$ then any distance $d'$ such that

  • Consistent learner
    Last edited: 2026-02-05

    Consistent learner

    Suppose we are in the modelling framework with some training data $T$ and a hypothesis space $H$. A consistent learner $h \in H$ is one where $h(a) = b$ for all $(a,b) \in T$.

  • Credit assignment problem
    Last edited: 2026-02-05

    Credit assignment problem

    The credit assignment problem in Reinforcement learning refers to the challenge of properly attributing credit or responsibility to actions that led to a particular outcome.

  • Cross validation
    Last edited: 2026-02-05

    Cross validation

    Suppose we are training a model on some training data $T$. If we partition $T$ into folds $T_i$ for $1 \leq i \leq k$. Then cross validation is the practice of training the model on all but one fold $T_j$ then assessing it on $T_j$ using our objective function . The practice usually involves doing this for all possible folds and picking the model with least error.

  • Crossover (genetic algorithms)
    Last edited: 2026-02-05

    Crossover (genetic algorithms)

    Suppose we have two points $a_1, a_2 \in A$. A crossover $C: A \times A \rightarrow A^k$ is a function that combines these points to produce offspring.

  • Degrees of freedom
    Last edited: 2026-02-05

    Degrees of freedom

    For a model the degrees of freedom is the number of parameters you are able to change.

  • Dependency Trees (Bayesian Network)
    Last edited: 2026-02-05

    Dependency Trees (Bayesian Network)

    A Bayesian network $(G, X)$ is a dependency tree if $G$ is a directed tree . That is, every node has at most one parent. Therefore, there is a function $\pi: V \rightarrow V \cup \{\emptyset\}$ which defines every vertex’s parent or lack thereof, which gives

  • Dimensionality reduction
    Last edited: 2026-02-05

    Dimensionality reduction

    Suppose we are in the modelling framework . Dimensionality reduction is the process of taking our input space $A$ and reducing the number of features within it. It consists of a map $p: \mathbb{F}^N \rightarrow \mathbb{F}^M$ where you usually want $M < N$.

  • Discounted rewards
    Last edited: 2026-02-05

    Discounted rewards

    Discounted rewards is a way to evaluate a Markov decision process with infinite steps. Let $r_i$ for $i \in \mathbb{N}$ be a sequence of rewards for this decision process. Let $\gamma \in [0,1)$ be some constant. Then we set the value of these choices as

  • Eager learner
    Last edited: 2026-02-05

    Eager learner

    A model is an eager learner if the learning process takes a long time but evaluation time is quick.

  • Ensemble learning
    Last edited: 2026-02-05

    Ensemble learning

    In machine learning ensemble learning is the approach of combining different models on the same problem and combining their results to provide a new model.

  • epsilon-exhausted version space
    Last edited: 2026-02-05

    epsilon-exhausted version space

    Suppose we are in the modelling framework with some training data $T \subset A \times B$, a hypothesis space $H \subset Func(A,B)$ and a probability distribution $\mathbb{D}$ on $A$. For some $0 \leq \epsilon \leq 0.5$ the version space $VS_H(T)$ is $\epsilon$-exhausted if and only if for all $h \in VS_H(T)$ the true error

  • Epsilon-greedy exploration
    Last edited: 2026-02-05

    Epsilon-greedy exploration

    $\epsilon$-greedy exploration is a way of choosing actions in Q-learning . For a sequence $\epsilon: \mathbb{N} \rightarrow [0,1]$ at time step $t \in \mathbb{N}$ you choose action

  • Error function (modelling)
    Last edited: 2026-02-05

    Error function (modelling)

    In the modelling framework the error function determines how good the current model $\hat{f}$ is at fitting the training data $T$. This is a function maps from our parameter space to some evaluation space, usually $\mathbb{R}$ - if we let $o$ be the objective function it normally takes the form

  • Error rate (modelling)
    Last edited: 2026-02-05

    Error rate (modelling)

    Suppose we are in the modelling framework and have probability distribution $\mathbb{D}: A \rightarrow [0,1]$ on our domain . We define the error rate to be

  • Expectation Maximisation
    Last edited: 2024-03-09

    Expectation maximising is an algorithm which can be used on problems such as clustering problem . It is used in situations where you assume your data follows some underlying distribution $\mathbb{D}$ which in this instance is given by parameters $\theta$. We assume $\theta \in H$ our hypothesis space .

    Suppose we draw $m$ samples from $\mathbb{D}_{\theta}$ which has an observed component $X = \{x_1, \ldots, x_m\}$ and an unobserved component $Z = \{z_1, \ldots, z_m\}$. Therefore the samples of $\mathbb{D}_{\theta}$ are given by $Y = X \cup Z$.

  • Explore exploit dilemma
    Last edited: 2026-02-05

    Explore exploit dilemma

    In Machine Learning there is a dilemma when training a model. You can explore the state space more, which can waste time in places that are irrelevant or you can exploit what you know more but can get you in a rut not finding new ways to do things. This same dilemma applies far beyond machine learning into literally everyone’s daily lives.

  • F1 score
    Last edited: 2026-02-05

    F1 score

    For some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define the F1 score to be the harmonic mean of precision and recall . That is

  • Filtering (feature selection)
    Last edited: 2026-02-05

    Filtering (feature selection)

    Suppose we have some learning problem where the input domain has a large number of features. Filtering is the process of selecting features for some learning algorithm where we do not explicitly know what the learning algorithm is.

  • Fold (cross validation)
    Last edited: 2026-02-05

    Fold (cross validation)

    Given some training data $T$ for cross validation we partition $T$ into $T_i$ for $1 \leq i \leq k$. A fold of $T$ is one of these partitions.

  • Gaussian kernel (SVM)
    Last edited: 2026-02-05

    Gaussian kernel

    The gaussian kernel is a function that can be used in the Kernel trick for SVMs . It has one variable $\sigma \in \mathbb{R}$ with $\sigma \not = 0$. It is defined as

  • Genetic algorithm (meta)
    Last edited: 2024-02-24

    Suppose we have a optimisation problem which as two properties associated to is domain $A$

    • There is a method to make a small change to any $a \in A$ called a mutation $M: A \rightarrow \mathcal{P}(A)$ where any of $m(a)$ are a small change to $a$.
    • It has some crossover function $C: A \times A \rightarrow A^k$.

    Then for a collection of points $P \subset A$ called a population the algorithm assess the fitness of each point $f(a)$ for $a \in P$.

  • Georgia Tech
    Last edited: 2026-01-28

    # Overview

    I am enrolled in the online Master of Science in Computer Science (OMSCS ) at Georgia Tech . I am pursuing this degree to properly understand machine learning and distributed computing.

  • Gini index
    Last edited: 2026-02-05

    Gini index

    Suppose we have a discrete random variable $X$ that can take values in $\{1, 2, \ldots, k\}$. We define the Gini index of $X$ to be

  • Gradient decent
    Last edited: 2026-02-05

    This algorithm uses differentiation to minimise error terms for models. Suppose we have a model with some parameters $w \in \mathbb{R}^k$. Further suppose the error function $E(w)$ is differentiable with respect to $w$. Then we set some learning rate $\eta$ and we perturb the weights $w$ by $w \leftarrow w - \eta \frac{\partial E}{\partial w}$ until $w$ settles in some local minimum.

  • Gridworld
    Last edited: 2026-02-05

    Gridworld

    A Grid world is any set of states that can be modelled as $S \subset \mathbb{Z}^n$. Where states can only transition to adjacent states in the grid system.

  • Haussler Theorem
    Last edited: 2026-01-28

    # Statement

    Haussler Theorem

    Suppose we are in the modelling framework with finite hypothesis space $H$ and probability distribution $\mathbb{D}$ on $A$. Set $0 \leq \epsilon \leq 1$. Let our training data $T$ be $m$ i.i.d. samples from $\mathbb{D}$ with associated correct answers. For any hypothesis $h \in H$ which is consistent with $T$ we have a bound on the true error

  • Hyperbolic tangent (tanh)
    Last edited: 2026-02-05

    Hyperbolic tangent

    The hyperbolic tangent function maps $tanh: \mathbb{R} \rightarrow (-1,1)$ by

  • Impossibility Theorem
    Last edited: 2026-01-28

    # Statement

    Lemma

    Any clustering algorithm cannot be all of the following:

  • Incremental learning
    Last edited: 2026-02-05

    Incremental learning

    Suppose we have a sequence of learning rates $\alpha: \mathbb{N} \rightarrow \mathbb{R}$. Then a value $V$ incrementally learns some random variable $X$ if for $t \in \mathbb{N}$ with samples $x_t$ from $X$ we set

  • Independent component analysis
    Last edited: 2024-03-10

    Independent component analysis is a form of linear dimension reduction . The goal of independent component analysis is to form a linear map to features which are independent of one another.

    Strictly if you previous features were $X_1, X_2, \ldots, X_n$ and you map to $Y_1, Y_2, \ldots, Y_m$ then we want the following statements about Mutual information :

  • Inductive bias
    Last edited: 2026-02-05

    Inductive bias

    This is any preference in a machine learning algorithm to pick one model over another. Examples of this are:

  • Information entropy
    Last edited: 2026-02-05

    Information Entropy

    Suppose we have a discrete random variable $X$ that can take values in $A$. We define the information entropy of $X$ to be

  • Instance-based learning
    Last edited: 2026-02-05

    Instance-based learning

    Instance-based learning is a branch of machine learning where instead of performing generalisations on the training data immediately and not refering back to them. It instead stores the instances of the training data and uses the when called for a value.

  • Introduction to statistical learning with Applications in Python
    Last edited: 2024-01-11

    This is a book by multiple authors about Machine Learning I was reading with a bunch of my friends. You can find it in my repo or for free on their website .

  • Irreducible error
    Last edited: 2026-02-05

    Irreducible error

    In any form of modelling, inherent irreducible error arises due to limitations in our knowledge, unobservable hidden variables, and measurement constraints. For a function $F: A \rightarrow B$, irreducible error is typically represented as a random variable $\epsilon$ in the target space $B$, with $\mathbb{E}[\epsilon] = 0$ —expressing that, on average, the irreducible error lacks preference in $B$.

  • Irrelevant feature
    Last edited: 2026-02-05

    Irrelevant feature

    For a learning problem a feature $x_i$ of the input space is irrelevant it is not a Strongly relevant feature or Weakly relevant feature .

  • Joint Entropy
    Last edited: 2026-02-05

    Joint Entropy

    Suppose we have two random variables $X$ and $Y$ over different domains $A$ and $B$. The joint entropy is defined by

  • k-means clustering
    Last edited: 2024-03-09

    Suppose we are in the clustering problem set up. Suppose we are provided $k$ the number of clusters. Further assume $A$ is a space with some method of averaging $Avg:2^A \rightarrow A$ which satisfies

    $$ \sum_{x \in X} d(x, Avg(X)) = \min_{a \in A} \sum_{x \in X} d(x,a), \mbox{ for all } X \subset A. $$

    In the Euclidian example we will have

    $$ Avg(X) = \frac{1}{\vert X \vert} \sum_{x \in X} x. $$

    To start the algorithm we pick $k$ random points in our training data $T$. We set these to be the current centres $center^0_i$ for $1 \leq i \leq k$. For iteration $t \in \mathbb{N}$ assume we have centres $center_i^{t-1} \in A$ (they don’t need to be points in $T$). Now we group the training data

  • Kernel trick
    Last edited: 2026-02-05

    Kernel trick

    Suppose we are in the modelling framework with training data $T \subset A \times B$. When using SVMs we want to find a hyperplane that linearly separates the data - though this might not be possible for the current embedding of $T$ in $A$. Though it might be possible for a map

  • Lazy learner
    Last edited: 2026-02-05

    Lazy learner

    A model is a lazy learner if it puts off compute until execution time.

  • Length of a probability
    Last edited: 2026-02-05

    Length of a probability

    For an event $A$ if $\mathbb{P}[A] = p$ then we say the length of $A$ is

  • Linear dimensionality reduction
    Last edited: 2026-02-05

    Linear dimensionality reduction

    This is dimension reduction where we limit the types of maps we do to be linear maps .

  • Linear regression
    Last edited: 2026-02-05

    Linear regression

    Linear regression is the special case of polynomial regression where the degree of the polynomial is 1. Therefore we are restricted to the modelling paradigm of

  • Machine Learning
    Last edited: 2026-02-05

    Machine Learning

    A computer programme is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.

  • Machine Learning by Tom Mitchell
    Last edited: 2024-01-10

    This is a book by Tom Mitchell about Machine Learning that was the course text book for CS7641 Machine Learning . You can find it the repo below or for free on his website .

  • Margin for a linear separator
    Last edited: 2026-02-05

    Margin for a linear separator

    Suppose we have a set of linearly separable points $X = X_1 \cup X_2 \subset \mathbb{R}^n$ such that $w \in \mathbb{R}^n$ and $b \in \mathbb{R}$ separate them. We define the margin of the hyperplane $H = (w, b)$ with respect to $X$ to be

  • Markov decision process
    Last edited: 2026-02-05

    Markov decision process

    A Markov decision process is defined by the following data:

  • Maximum a posteriori probability estimate (MAP)
    Last edited: 2026-02-05

    Maximum a posteriori probability estimate (MAP)

    Suppose we have a hypothesis space $H$ and we want to pick the best hypothesis given some data $D$. Further more suppose we have prior belief about the likelihood of each hypothesis represented by a probability distribution over $H$. The maximum a posteriori probability estimate is

  • Maximum likelihood estimation (MLE)
    Last edited: 2026-02-05

    Maximum likelihood estimation (MLE)

    Suppose we have a hypothesis space $H$ and we want to pick the best hypothesis given some data $D$. The maximum likelihood estimation is

  • Mean squared error (MSE)
    Last edited: 2026-02-05

    Mean Squared Error (MSE)

    The mean squared error is the square of the $l_2$ norm for two points in $B^k$, i.e.

  • Mistake bound
    Last edited: 2026-02-05

    Mistake bound

    During an infinite run of a learning task. What is maximum number of mistakes the algorithm could make in terms of the input size.

  • Model-based reinforcement learning
    Last edited: 2026-02-05

    Model-based reinforcement learning

    Model-based reinforcement learning is a method of performing reinforcement learning where you use the provided transitions to build your model of the problem, then use a planner such as Value iteration (MDP) or Policy Iteration (MDP) to produce a policy.

  • Modelling bias
    Last edited: 2026-02-05

    Modelling bias

    Definition here

  • Modelling framework
    Last edited: 2026-02-05

    Modelling framwork

    Suppose we have some random variable we want to predict $Y$ (over a space $B$) and some set of features or predictors in $A$ to make predictions of $Y$ this are sampled from a random variable $X$. You assume there is some relationship between $Y$ and $X$ given by

  • Modelling paradigm
    Last edited: 2026-02-05

    Modelling paradigm

    In the modelling framework , normally the space of all function $f: A \rightarrow B \in Func(A,B)$ is much too large. Therefore we choose some modelling paradigm $M \subset Func(A,B)$ to pick our candidates from. For example, in linear regression , we choose to pick linear functions is some predefined set of inputs.

  • Mutation (genetic algorithms)
    Last edited: 2026-02-05

    Mutation (genetic algorithms)

    In genetic algorithms a mutation is a small local change.

  • Naive Bayes classifier
    Last edited: 2024-02-24

    Suppose we have a classification problem and we are in the modelling framework . We are given features $A = \oplus_{i=1}^n A_i$ where each feature $A_i$ has finite values (also split up the random variable $X$ into $X = \oplus_{i=1}^n X_i$ for our input). Let our classification also have finite domain $B$.

    We can use our testing data to estimate $\mathbb{P}[Y=b]$ for each $b \in B$ and $\mathbb{P}[X_i = a_i \vert Y = b]$ for each $a_i \in A_i$ and $b \in B$.

  • Objective function
    Last edited: 2026-02-05

    Objective function

    An objective function $o: S \rightarrow E$ is simply a method to assess how “well” some function is doing. The evaluation space $E$ should ideally be totally ordered .

  • Optimisation problem
    Last edited: 2026-02-05

    Optimisation problem

    For an optimisation problem we are given an input space $A$ and a fitness function $f: A \rightarrow B$ where $B$ is ordered (usually $\mathbb{R}$). We want to find

  • Optimistic exploration
    Last edited: 2026-02-05

    Optimistic exploration

    Optimistic exploration is a way of choosing actions in Q-learning . Here we set the initial values of our $\hat{Q}$ to all be very high and we always pick the action that maximises $\hat{Q}$. Then in uncertainty it will explore actions it does not know about.

  • Overfitting
    Last edited: 2026-02-05

    Overfitting

    When training a model on training data , if your model too closely imitates that data without being able to generalise to the test data - you have overfitting.

  • PAC learnable bound with VC-dimension
    Last edited: 2026-01-28

    # Statement

    Lemma

    For a learner with a hypothesis space $H$ with VC dimension $VC(H)$, if we draw

  • PAC-learnable if and only if finite VC dimension
    Last edited: 2026-01-28

    # Statement

    Lemma

    A hypothesis space $H$ is PAC learnable if and only if its VC dimension is finite.

  • Perceptron (neural network)
    Last edited: 2026-02-05

    Perceptron (neural network)

    A perceptron is a function $p: \mathbb{R}^n \rightarrow \mathbb{R}$ defined by two steps - first a weighted linear sum, the second an activation function . To define this we need weights $w_i$ for $1 \leq i \leq n$ and an activation function $a: \mathbb{R} \rightarrow \mathbb{R}$. Where

  • Polynomial kernel (SVMs)
    Last edited: 2026-02-05

    Polynomial kernel

    The polynomial kernel is a function that can be used in the Kernel trick for SVMs . It has two variables the degree $d \in \mathbb{N}$ and constant $c \in \mathbb{R}$. It is defined as

  • Polynomial regression
    Last edited: 2026-02-05

    Polynomial regression

    In the modelling framework saying we are doing polynomial regression is saying that we are picking the modelling paradigm of functions of the form

  • Polysemy
    Last edited: 2026-02-05

    Polysemy

    One word having multiple meanings in different context.

  • Pre-pruning decision trees
    Last edited: 2026-02-05

    Pre-pruning decision trees

    When building a decision tree to avoid overfitting you may want to stop the branching process even if it could increase the accuracy on the training data . One method to do this is to put in requirements on the final form of the decision tree before you start training. The requirements might be:

  • Precision
    Last edited: 2026-02-05

    Precision

    For some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define

  • Preference bias
    Last edited: 2026-02-05

    Preference bias

    Preference bias occurs in the learning process of an algorithm. This will be from how the objective function chooses to go from one step to the next. This will generate preferences for different models.

  • Principle component analysis
    Last edited: 2026-02-05

    Principle component analysis is a linear dimension reduction algorithm. In concept principle component analysis find the axis along which the data has maximal variance if it were projected. It does this by finding the Eigenvector and Eigenvalues of the Covariance matrix . It uses these eigenvectors as a new basis of the data’s feature space.

  • Probably approximately correct learnable (PAC)
    Last edited: 2026-02-05

    Probably approximately correct learnable (PAC)

    A concept class $C$ is probably approximately correct learnable by learner $L$ using hypothesis space $H$ with error $0 \leq \epsilon \leq 0.5$ and probability $0 \leq 1 - \delta \leq 1$ if and only if learner $L$ when ran on concept $c \in C$ with probability $1-\delta$ outputs hypothesis $h \in H$ such that the true error

  • Q-function (RL)
    Last edited: 2026-02-05

    Q-function (RL)

    Suppose we are in a Markov decision process and using discounted rewards define the Q-function $Q : S \times A \rightarrow \mathbb{R}$ as the solution to the following equations

  • Recall
    Last edited: 2026-02-05

    Recall

    For some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define

  • Regression problems
    Last edited: 2026-02-05

    Regression problems

    Regression problems are a subset of supervised learning problems where the codomain $B$ is infinite.

  • Reinforcement learning
    Last edited: 2026-02-05

    Reinforcement learning

    This is a branch of Machine Learning applied in a setting where a learner has a number of actions to take in a given state - each action will lead to a different reward and we want to maximise it. You are provided with previous transitions and want to build a policy in the sense of a Markov decision process .

  • Restart hill climbing
    Last edited: 2024-02-24

    # Random restart hill climbing

    Suppose we have a optimisation problem where $A$ has some sense of neighbourhood $N(a)$ for each point $a \in A$.

    Then we want to run Hill climbing but try to get around the issue with only finding local optimum by at the end of a run restarting in a new place.

  • Restriction Bias
    Last edited: 2026-02-05

    Restriction bias

    Restriction bias occurs from choosing a modelling paradigm , $M \subset Func(A,B)$, as your candidate can not be some functions in the space $Func(A,B)$ you are biased towards the ones in $M$.

  • Rich clustering
    Last edited: 2026-02-05

    Rich clustering

    Suppose we are in the clustering problem set up. A clustering algorithm $C$ is considered rich if for any assignment $f: T \rightarrow B$ there is a distance $d$ on $T$ such that the clustering algorithm will give you that assignment $C(d) = f$.

  • Sample complexity
    Last edited: 2026-02-05

    Sample complexity

    Similar to run time complexity it is the number of samples required for a model to converge to the optimum hypothesis in terms of the problem size.

  • Scale-invariant clustering
    Last edited: 2026-02-05

    Scale-inverient clustering

    Suppose we are in the clustering problem set up. A clustering algorithm $C$ is considered scale-invariant if for distance $d$ on $T$ if we scale the distance by some $\alpha$ (i.e. $d_{\alpha}(a, b) = d(a,b) \alpha$) then it still produces the same clustering $C(d) = c(d_{\alpha})$.

  • Sigmoid function
    Last edited: 2026-02-05

    Sigmoid function

    The sigmoid function $\sigma: \mathbb{R} \rightarrow (0,1)$ is an attempt to continuously approximate the binary step activation function

  • Sigmoid kernel (SVM)
    Last edited: 2026-02-05

    Sigmoid kernel

    The sigmoid kernel is a function that can be used in the Kernel trick for SVMs . It has two variables $a, b \in \mathbb{R}_{\geq 0}$. It is defined as

  • Simulated Annealing
    Last edited: 2024-02-24

    Suppose we have a optimisation problem where $A$ has some sense of neighbourhood $N(a)$ for each point $a \in A$.

    Then we want to emulate Hill climbing but probabilistically allow for random walking. We do this by randomly deciding to take moves from $x$ to $x_t$ based on the temperature of the algorithm at that time $T$ this is given by

    $$P(x,x_t,T) = \begin{cases} 1 & \mbox{if } f(x_t) \geq f(x)\\ \exp{\frac{f(x_t) - f(x)}{T}} & \mbox{otherwise.} \end{cases}$$

    Then we just modify what is done in Hill climbing with this and gradually decrease the temperature and we have simulated annealing.

  • Simulated annealing ending probability
    Last edited: 2026-01-28

    # Statement

    Lemma

    In the Simulated Annealing algorithm with some unspecified assumptions, we have

  • Single linkage clustering
    Last edited: 2024-03-09

    Suppose we are in the clustering problem set up. Here we are given the number of desired clusters $k$, and the sense of distance $d$.

    The idea will be to make every element in our training data $T$ its own cluster. Then we take the union of two clusters which are considered the closest. For this linkage clustering we use the minimum distance of points in the cluster. i.e.

  • Soft clustering
    Last edited: 2024-03-09

    Suppose we are in the clustering problem set up. Soft clustering will use Expectation Maximisation with the Gaussian distribution . Here we fix a $\sigma^2$ and then we have

    $$ \mathbb{P}[X=x_i \vert \mu = \mu_j] = \exp\left[- \frac{1}{2} \sigma^2 (x_i - \mu_j)^2 \right]. $$

    We will use a mean to optimise $\mu_j$ at each step.

    # Correctness

  • Step function methods
    Last edited: 2026-02-05

    Step function methods

    Suppose we have a function $f: \mathbb{R} \rightarrow \mathbb{R}$ a step function method uses step function to model this. Here we choose intervals of the domain to define different functions on each region.

  • Strongly relevant feature
    Last edited: 2026-02-05

    Strongly relevant feature

    For a learning problem a feature $x_i$ of the input space is strongly relevant if by removing it it degrades the Bayeses optimal classifier .

  • Supervised learning
    Last edited: 2026-02-05

    Supervised learning

    In supervised learning we are given an input domain $A$ and an output codomain $B$ with some observations $O \subset A \times B$ of a function $f^{\ast} : A \rightarrow B$. The goal is to approximate $f^{\ast}$ the best we can given these observations.

  • Synonymy
    Last edited: 2026-02-05

    Synonymy

    Multiple words having the same meaning in a particular context.

  • Testing data
    Last edited: 2026-02-05

    Testing data

    Within the modelling framework , testing data is a set of observations $T \subset A \times B$ which are normally samples from $X$ and $Y$. You would use this to assess models after they have been trained. This should be kept distinct from you Training data .

  • The curse of dimensionality
    Last edited: 2024-01-22

    In the modelling framework the curse of dimensionality states as the dimension of $A$ our feature space grows, the amount of training data we need to generalise accurately grows exponentially.

  • Tit for Tat
    Last edited: 2026-02-05

    Tit for Tat

    In a symmetric two player game, the tit for tat strategy just copies the opponents move. In the first round it will do something random.

  • Training data
    Last edited: 2026-02-05

    Training data

    Within the modelling framework , training data is a set of observations $T \subset A \times B$ which are normally samples from $X$ and $Y$. You would use this to train models. This should be kept distinct from your testing data .

  • Training error
    Last edited: 2026-02-05

    Training error

    Suppose we are in the modelling framework with some training data $T$ and hypothesis space $H$. For some candidate hypothesis $h \in H$ the training error

  • Transforming discrete input for regression
    Last edited: 2024-01-17

    Suppose we are in the modelling framework and have been given a discrete input variable $d \in A'$ for a regression problem . If $d$ has $n$ possible values $v_1, v_2, \ldots, v_n$ we introduce $n-1$ dimensional space $\tilde{A} = \mathbb{R}^{n-1}$. Then we define the transformation $t: A' \rightarrow \tilde{A} = \mathbb{R}^{n-1}$

    $$t(v_i) = \begin{cases} e_i & \mbox{if } 1 \leq i \leq n-1\\ 0 & \mbox{otherwise} \end{cases}$$

    where $e_1, e_2, \ldots e_{n-1}$ are the basis vectors of $\tilde{A}$.

  • Transitions (MDP)
    Last edited: 2026-02-05

    Transitions (MDP)

    When presented with a learning problem you are not provided with the Markov decision processes instead you will be provided with transitions which are the outcome of a previous action - i.e. a tuple $(s,a,r,s') \in S \times A \times \mathbb{R} \times S$ where $s \in S$ is the start state, $a \in A$ is the action taken, $r \in \mathbb{R}$ is the reward gotten and $s' \in S$ the is state you ended up in. These can either be pre-computed or given a state the learn can provide the action to find out what the transition was.

  • True error
    Last edited: 2026-02-05

    True error

    Suppose we are in the modelling framework with feature space $A$ with some probability distribution $\mathbb{D}$ on it, a hypothesis space $H$ and the true concept $f: A \rightarrow B$. For some candidate hypothesis $h \in H$, the true error is:

  • Underfitting
    Last edited: 2026-02-05

    Underfitting

    When training a model on training data , if your model does not imitate that data closely enough as you haven’t provided enough degrees of freedom - you have underfitting.

  • Unsupervised learning
    Last edited: 2026-02-05

    Unsupervised learning

    In unsupervised learning we are given an input domain $A$ with some training data $T \subset A$ which are not labelled (you may or may not be provided with an output codomain $B$) . The goal is to make some $f: A \rightarrow B$ (and $B$) that groups items that are “related” to one another given $T$.

  • Useful feature
    Last edited: 2026-02-05

    Useful feature

    For a learning problem, learning algorithm and measure of error a feature $x_i$ of the input space is useful if including it you can decrease the error.

  • Value iteration (MDP)
    Last edited: 2024-04-06

    This is a method of solving Markov decision processes . It uses discounted rewards to get a local as well as global optimum solution.

    We ideally would like to work out what the best policy is

    $$ \pi^{\ast} = \mbox{arg}\max_{\pi} \mathbb{E}\left [ \sum_{i=1}^{\infty} \gamma^i R(s_t) \ \Bigg \vert \ \pi \right ]. $$

    To help us work out what $\pi^{\ast}$ should be lets define the utility of a state

  • Vapnik-Chervonenkis dimension
    Last edited: 2026-02-05

    Vapnik-Chervonenkis dimension

    Suppose we are in the modelling framework with feature space $A$, domain $B$, and we have hypothesis space $H$. The VC-dimension of $H$ is the size of the largest set $S \subset A$ such that for all $l: S \rightarrow B$ there exists $h_l \in H$ such that $h_l(s) = l(s)$ for all $s \in S$.

  • Version space
    Last edited: 2026-02-05

    Version space

    Suppose we are in the modelling framework with some training data $T$ and a hypothesis space $H$. The version space for $T$ is

  • Weak learner
    Last edited: 2026-02-05

    Weak learner

    Suppose we are in the modelling framework a model $\hat{f}$ is called a weak learner if for all probability distributions $\mathbb{D}: A \rightarrow [0,1]$ there exists a sufficiently small $\epsilon$ such that

  • Weakly relevant feature
    Last edited: 2026-02-05

    Weakly relevant feature

    For a learning problem a feature $x_i$ of the input space is weakly relevant if

  • Wrapping (feature selection)
    Last edited: 2026-02-05

    Wrapping (feature selection)

    Suppose we have some learning problem where the input domain has a large number of features. Wrapping is the process of selecting features for a learning algorithm by using that algorithm to evaluate how it performs on different subsets of features.