Machine-Learning

Pages in this section

Accuracy
Last edited: 2026-02-05
Accuracy
Suppose we some model $\hat{f}$ predicting $f$. For some test data $T$ we define
Activation function
Last edited: 2026-02-05
Activation function
An activation function $a: \mathbb{R} \rightarrow \mathbb{R}$ gets applied in a perceptron after the weighted sum of the inputs. It is the non-linear term. Classic activation functions are
Bayeses optimal classifier
Last edited: 2026-02-05
Bayesian classification
Suppose we have a classification problem for some function $f: A \rightarrow B$. We have a hypothesis space $H$ and training data $T$. We want to work out what the best label is for $a \in A$. We can use maximum a posteriori probability to calculate Bayeses optimal classifier
Binary step
Last edited: 2026-02-05
Binary step
The binary step is an activation function that determines what the value is in reference to a parameter $\theta$.
Calculate polynomial regression coefficients for MSE
Last edited: 2024-01-17
To demonstrate the calculation of the coefficients for polynomial regression with MSE suppose we have 1-dimensional input $A$ and training data $(x_i, y_i)$. For an order $k$ polynomial we are looking for coefficients $c_j$ for $0 \leq j \leq k$ that roughly do the following
$$ \left( \begin{array} \ 1 & x_1 & x_1^2 & \ldots & x_1^k\\ 1 & x_2 & x_2^2 & \ldots & x_2^k\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_n & x_n^2 & \ldots & x_n^k \\\end{array} \right) \left( \begin{array} \ c_1\\ c_2\\ \vdots\\ c_k \\\end{array} \right) \approx \left( \begin{array} \ y_1\\ y_2\\ \vdots\\ y_n \\\end{array} \right), \ \ \ \ \mbox{we call this} \ \ \ \ X C \approx Y $$
in a completely not mathematically sound way we can do the following rearrangement
Classification problems
Last edited: 2026-02-05
Classification problems
Classification problems are a subset of supervised learning problems where the codomain $B$ is a finite set.
Clustering Problem
Last edited: 2026-02-05
Clustering problem
Suppose we are in the unsupervised learning setting and we have a sense of distance on $A$, $d(\cdot, \cdot)$ (we assume $d(x,y) = d(y,x)$ for $x,y \in A$). We want to find $B$ a set of categories and a function $f: A \to B$ such that $f$ maps items that are close to one another to the same category.
Concept
Last edited: 2026-02-05
Concept
Suppose we have a set $A$ then a concept about $A$ is a function $f: A \rightarrow \{0,1\}$. I.e. tall people would map people who were considered tall to true and the rest to false.
Concept class
Last edited: 2026-02-05
Concept class
Given some underlying set $A$ a concept class is a set of concepts
Consistent clustering
Last edited: 2026-02-05
Consistent clustering
Suppose we are in the clustering problem set up. A clustering algorithm $C$ is considered consistent if for distance $d$ on $T$ we get clustering $C(d) = f$ then any distance $d'$ such that
Consistent learner
Last edited: 2026-02-05
Consistent learner
Suppose we are in the modelling framework with some training data $T$ and a hypothesis space $H$. A consistent learner $h \in H$ is one where $h(a) = b$ for all $(a,b) \in T$.
Credit assignment problem
Last edited: 2026-02-05
Credit assignment problem
The credit assignment problem in Reinforcement learning refers to the challenge of properly attributing credit or responsibility to actions that led to a particular outcome.
Cross validation
Last edited: 2026-02-05
Cross validation
Suppose we are training a model on some training data $T$. If we partition $T$ into folds $T_i$ for $1 \leq i \leq k$. Then cross validation is the practice of training the model on all but one fold $T_j$ then assessing it on $T_j$ using our objective function . The practice usually involves doing this for all possible folds and picking the model with least error.
Crossover (genetic algorithms)
Last edited: 2026-02-05
Crossover (genetic algorithms)
Suppose we have two points $a_1, a_2 \in A$. A crossover $C: A \times A \rightarrow A^k$ is a function that combines these points to produce offspring.
Degrees of freedom
Last edited: 2026-02-05
Degrees of freedom
For a model the degrees of freedom is the number of parameters you are able to change.
Dependency Trees (Bayesian Network)
Last edited: 2026-02-05
Dependency Trees (Bayesian Network)
A Bayesian network $(G, X)$ is a dependency tree if $G$ is a directed tree . That is, every node has at most one parent. Therefore, there is a function $\pi: V \rightarrow V \cup \{\emptyset\}$ which defines every vertex’s parent or lack thereof, which gives
Dimensionality reduction
Last edited: 2026-02-05
Dimensionality reduction
Suppose we are in the modelling framework . Dimensionality reduction is the process of taking our input space $A$ and reducing the number of features within it. It consists of a map $p: \mathbb{F}^N \rightarrow \mathbb{F}^M$ where you usually want $M < N$.
Discounted rewards
Last edited: 2026-02-05
Discounted rewards
Discounted rewards is a way to evaluate a Markov decision process with infinite steps. Let $r_i$ for $i \in \mathbb{N}$ be a sequence of rewards for this decision process. Let $\gamma \in [0,1)$ be some constant. Then we set the value of these choices as
Eager learner
Last edited: 2026-02-05
Eager learner
A model is an eager learner if the learning process takes a long time but evaluation time is quick.
Ensemble learning
Last edited: 2026-02-05
Ensemble learning
In machine learning ensemble learning is the approach of combining different models on the same problem and combining their results to provide a new model.
epsilon-exhausted version space
Last edited: 2026-02-05
epsilon-exhausted version space
Suppose we are in the modelling framework with some training data $T \subset A \times B$, a hypothesis space $H \subset Func(A,B)$ and a probability distribution $\mathbb{D}$ on $A$. For some $0 \leq \epsilon \leq 0.5$ the version space $VS_H(T)$ is $\epsilon$-exhausted if and only if for all $h \in VS_H(T)$ the true error
Epsilon-greedy exploration
Last edited: 2026-02-05
Epsilon-greedy exploration
$\epsilon$-greedy exploration is a way of choosing actions in Q-learning . For a sequence $\epsilon: \mathbb{N} \rightarrow [0,1]$ at time step $t \in \mathbb{N}$ you choose action
Error function (modelling)
Last edited: 2026-02-05
Error function (modelling)
In the modelling framework the error function determines how good the current model $\hat{f}$ is at fitting the training data $T$. This is a function maps from our parameter space to some evaluation space, usually $\mathbb{R}$ - if we let $o$ be the objective function it normally takes the form
Error rate (modelling)
Last edited: 2026-02-05
Error rate (modelling)
Suppose we are in the modelling framework and have probability distribution $\mathbb{D}: A \rightarrow [0,1]$ on our domain . We define the error rate to be
Expectation Maximisation
Last edited: 2024-03-09
Expectation maximising is an algorithm which can be used on problems such as clustering problem . It is used in situations where you assume your data follows some underlying distribution $\mathbb{D}$ which in this instance is given by parameters $\theta$. We assume $\theta \in H$ our hypothesis space .
Suppose we draw $m$ samples from $\mathbb{D}_{\theta}$ which has an observed component $X = \{x_1, \ldots, x_m\}$ and an unobserved component $Z = \{z_1, \ldots, z_m\}$. Therefore the samples of $\mathbb{D}_{\theta}$ are given by $Y = X \cup Z$.
Explore exploit dilemma
Last edited: 2026-02-05
Explore exploit dilemma
In Machine Learning there is a dilemma when training a model. You can explore the state space more, which can waste time in places that are irrelevant or you can exploit what you know more but can get you in a rut not finding new ways to do things. This same dilemma applies far beyond machine learning into literally everyone’s daily lives.
F1 score
Last edited: 2026-02-05
F1 score
For some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define the F1 score to be the harmonic mean of precision and recall . That is
Filtering (feature selection)
Last edited: 2026-02-05
Filtering (feature selection)
Suppose we have some learning problem where the input domain has a large number of features. Filtering is the process of selecting features for some learning algorithm where we do not explicitly know what the learning algorithm is.
Fold (cross validation)
Last edited: 2026-02-05
Fold (cross validation)
Given some training data $T$ for cross validation we partition $T$ into $T_i$ for $1 \leq i \leq k$. A fold of $T$ is one of these partitions.
Gaussian kernel (SVM)
Last edited: 2026-02-05
Gaussian kernel
The gaussian kernel is a function that can be used in the Kernel trick for SVMs . It has one variable $\sigma \in \mathbb{R}$ with $\sigma \not = 0$. It is defined as
Genetic algorithm (meta)
Last edited: 2024-02-24
Suppose we have a optimisation problem which as two properties associated to is domain $A$
- There is a method to make a small change to any $a \in A$ called a mutation $M: A \rightarrow \mathcal{P}(A)$ where any of $m(a)$ are a small change to $a$.
- It has some crossover function $C: A \times A \rightarrow A^k$.
Then for a collection of points $P \subset A$ called a population the algorithm assess the fitness of each point $f(a)$ for $a \in P$.
Georgia Tech
Last edited: 2026-01-28
# Overview
I am enrolled in the online Master of Science in Computer Science (OMSCS ) at Georgia Tech . I am pursuing this degree to properly understand machine learning and distributed computing.
Gini index
Last edited: 2026-02-05
Gini index
Suppose we have a discrete random variable $X$ that can take values in $\{1, 2, \ldots, k\}$. We define the Gini index of $X$ to be
Gradient decent
Last edited: 2026-02-05
This algorithm uses differentiation to minimise error terms for models. Suppose we have a model with some parameters $w \in \mathbb{R}^k$. Further suppose the error function $E(w)$ is differentiable with respect to $w$. Then we set some learning rate $\eta$ and we perturb the weights $w$ by $w \leftarrow w - \eta \frac{\partial E}{\partial w}$ until $w$ settles in some local minimum.
Gridworld
Last edited: 2026-02-05
Gridworld
A Grid world is any set of states that can be modelled as $S \subset \mathbb{Z}^n$. Where states can only transition to adjacent states in the grid system.
Haussler Theorem
Last edited: 2026-01-28
# Statement
Haussler Theorem
Suppose we are in the modelling framework with finite hypothesis space $H$ and probability distribution $\mathbb{D}$ on $A$. Set $0 \leq \epsilon \leq 1$. Let our training data $T$ be $m$ i.i.d. samples from $\mathbb{D}$ with associated correct answers. For any hypothesis $h \in H$ which is consistent with $T$ we have a bound on the true error
Hyperbolic tangent (tanh)
Last edited: 2026-02-05
Hyperbolic tangent
The hyperbolic tangent function maps $tanh: \mathbb{R} \rightarrow (-1,1)$ by
Impossibility Theorem
Last edited: 2026-01-28
# Statement
Lemma
Any clustering algorithm cannot be all of the following:
Incremental learning
Last edited: 2026-02-05
Incremental learning
Suppose we have a sequence of learning rates $\alpha: \mathbb{N} \rightarrow \mathbb{R}$. Then a value $V$ incrementally learns some random variable $X$ if for $t \in \mathbb{N}$ with samples $x_t$ from $X$ we set
Independent component analysis
Last edited: 2024-03-10
Independent component analysis is a form of linear dimension reduction . The goal of independent component analysis is to form a linear map to features which are independent of one another.
Strictly if you previous features were $X_1, X_2, \ldots, X_n$ and you map to $Y_1, Y_2, \ldots, Y_m$ then we want the following statements about Mutual information :
Inductive bias
Last edited: 2026-02-05
Inductive bias
This is any preference in a machine learning algorithm to pick one model over another. Examples of this are:
Information entropy
Last edited: 2026-02-05
Information Entropy
Suppose we have a discrete random variable $X$ that can take values in $A$. We define the information entropy of $X$ to be
Instance-based learning
Last edited: 2026-02-05
Instance-based learning
Instance-based learning is a branch of machine learning where instead of performing generalisations on the training data immediately and not refering back to them. It instead stores the instances of the training data and uses the when called for a value.
Introduction to statistical learning with Applications in Python
Last edited: 2024-01-11
This is a book by multiple authors about Machine Learning I was reading with a bunch of my friends. You can find it in my repo or for free on their website .
Irreducible error
Last edited: 2026-02-05
Irreducible error
In any form of modelling, inherent irreducible error arises due to limitations in our knowledge, unobservable hidden variables, and measurement constraints. For a function $F: A \rightarrow B$, irreducible error is typically represented as a random variable $\epsilon$ in the target space $B$, with $\mathbb{E}[\epsilon] = 0$ —expressing that, on average, the irreducible error lacks preference in $B$.
Irrelevant feature
Last edited: 2026-02-05
Irrelevant feature
For a learning problem a feature $x_i$ of the input space is irrelevant it is not a Strongly relevant feature or Weakly relevant feature .
Joint Entropy
Last edited: 2026-02-05
Joint Entropy
Suppose we have two random variables $X$ and $Y$ over different domains $A$ and $B$. The joint entropy is defined by
k-means clustering
Last edited: 2024-03-09
Suppose we are in the clustering problem set up. Suppose we are provided $k$ the number of clusters. Further assume $A$ is a space with some method of averaging $Avg:2^A \rightarrow A$ which satisfies
$$ \sum_{x \in X} d(x, Avg(X)) = \min_{a \in A} \sum_{x \in X} d(x,a), \mbox{ for all } X \subset A. $$
In the Euclidian example we will have
$$ Avg(X) = \frac{1}{\vert X \vert} \sum_{x \in X} x. $$
To start the algorithm we pick $k$ random points in our training data $T$. We set these to be the current centres $center^0_i$ for $1 \leq i \leq k$. For iteration $t \in \mathbb{N}$ assume we have centres $center_i^{t-1} \in A$ (they don’t need to be points in $T$). Now we group the training data
Kernel trick
Last edited: 2026-02-05
Kernel trick
Suppose we are in the modelling framework with training data $T \subset A \times B$. When using SVMs we want to find a hyperplane that linearly separates the data - though this might not be possible for the current embedding of $T$ in $A$. Though it might be possible for a map
Lazy learner
Last edited: 2026-02-05
Lazy learner
A model is a lazy learner if it puts off compute until execution time.
Length of a probability
Last edited: 2026-02-05
Length of a probability
For an event $A$ if $\mathbb{P}[A] = p$ then we say the length of $A$ is
Linear dimensionality reduction
Last edited: 2026-02-05
Linear dimensionality reduction
This is dimension reduction where we limit the types of maps we do to be linear maps .
Linear regression
Last edited: 2026-02-05
Linear regression
Linear regression is the special case of polynomial regression where the degree of the polynomial is 1. Therefore we are restricted to the modelling paradigm of
Machine Learning
Last edited: 2026-02-05
Machine Learning
A computer programme is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
Machine Learning by Tom Mitchell
Last edited: 2024-01-10
This is a book by Tom Mitchell about Machine Learning that was the course text book for CS7641 Machine Learning . You can find it the repo below or for free on his website .
Margin for a linear separator
Last edited: 2026-02-05
Margin for a linear separator
Suppose we have a set of linearly separable points $X = X_1 \cup X_2 \subset \mathbb{R}^n$ such that $w \in \mathbb{R}^n$ and $b \in \mathbb{R}$ separate them. We define the margin of the hyperplane $H = (w, b)$ with respect to $X$ to be
Markov decision process
Last edited: 2026-02-05
Markov decision process
A Markov decision process is defined by the following data:
Maximum a posteriori probability estimate (MAP)
Last edited: 2026-02-05
Maximum a posteriori probability estimate (MAP)
Suppose we have a hypothesis space $H$ and we want to pick the best hypothesis given some data $D$. Further more suppose we have prior belief about the likelihood of each hypothesis represented by a probability distribution over $H$. The maximum a posteriori probability estimate is
Maximum likelihood estimation (MLE)
Last edited: 2026-02-05
Maximum likelihood estimation (MLE)
Suppose we have a hypothesis space $H$ and we want to pick the best hypothesis given some data $D$. The maximum likelihood estimation is
Mean squared error (MSE)
Last edited: 2026-02-05
Mean Squared Error (MSE)
The mean squared error is the square of the $l_2$ norm for two points in $B^k$, i.e.
Mistake bound
Last edited: 2026-02-05
Mistake bound
During an infinite run of a learning task. What is maximum number of mistakes the algorithm could make in terms of the input size.
Model-based reinforcement learning
Last edited: 2026-02-05
Model-based reinforcement learning
Model-based reinforcement learning is a method of performing reinforcement learning where you use the provided transitions to build your model of the problem, then use a planner such as Value iteration (MDP) or Policy Iteration (MDP) to produce a policy.
Modelling bias
Last edited: 2026-02-05
Modelling bias
Definition here
Modelling framework
Last edited: 2026-02-05
Modelling framwork
Suppose we have some random variable we want to predict $Y$ (over a space $B$) and some set of features or predictors in $A$ to make predictions of $Y$ this are sampled from a random variable $X$. You assume there is some relationship between $Y$ and $X$ given by
Modelling paradigm
Last edited: 2026-02-05
Modelling paradigm
In the modelling framework , normally the space of all function $f: A \rightarrow B \in Func(A,B)$ is much too large. Therefore we choose some modelling paradigm $M \subset Func(A,B)$ to pick our candidates from. For example, in linear regression , we choose to pick linear functions is some predefined set of inputs.
Mutation (genetic algorithms)
Last edited: 2026-02-05
Mutation (genetic algorithms)
In genetic algorithms a mutation is a small local change.
Naive Bayes classifier
Last edited: 2024-02-24
Suppose we have a classification problem and we are in the modelling framework . We are given features $A = \oplus_{i=1}^n A_i$ where each feature $A_i$ has finite values (also split up the random variable $X$ into $X = \oplus_{i=1}^n X_i$ for our input). Let our classification also have finite domain $B$.
We can use our testing data to estimate $\mathbb{P}[Y=b]$ for each $b \in B$ and $\mathbb{P}[X_i = a_i \vert Y = b]$ for each $a_i \in A_i$ and $b \in B$.
Objective function
Last edited: 2026-02-05
Objective function
An objective function $o: S \rightarrow E$ is simply a method to assess how “well” some function is doing. The evaluation space $E$ should ideally be totally ordered .
Optimisation problem
Last edited: 2026-02-05
Optimisation problem
For an optimisation problem we are given an input space $A$ and a fitness function $f: A \rightarrow B$ where $B$ is ordered (usually $\mathbb{R}$). We want to find
Optimistic exploration
Last edited: 2026-02-05
Optimistic exploration
Optimistic exploration is a way of choosing actions in Q-learning . Here we set the initial values of our $\hat{Q}$ to all be very high and we always pick the action that maximises $\hat{Q}$. Then in uncertainty it will explore actions it does not know about.
Overfitting
Last edited: 2026-02-05
Overfitting
When training a model on training data , if your model too closely imitates that data without being able to generalise to the test data - you have overfitting.
PAC learnable bound with VC-dimension
Last edited: 2026-01-28
# Statement
Lemma
For a learner with a hypothesis space $H$ with VC dimension $VC(H)$, if we draw
PAC-learnable if and only if finite VC dimension
Last edited: 2026-01-28
# Statement
Lemma
A hypothesis space $H$ is PAC learnable if and only if its VC dimension is finite.
Perceptron (neural network)
Last edited: 2026-02-05
Perceptron (neural network)
A perceptron is a function $p: \mathbb{R}^n \rightarrow \mathbb{R}$ defined by two steps - first a weighted linear sum, the second an activation function . To define this we need weights $w_i$ for $1 \leq i \leq n$ and an activation function $a: \mathbb{R} \rightarrow \mathbb{R}$. Where
Polynomial kernel (SVMs)
Last edited: 2026-02-05
Polynomial kernel
The polynomial kernel is a function that can be used in the Kernel trick for SVMs . It has two variables the degree $d \in \mathbb{N}$ and constant $c \in \mathbb{R}$. It is defined as
Polynomial regression
Last edited: 2026-02-05
Polynomial regression
In the modelling framework saying we are doing polynomial regression is saying that we are picking the modelling paradigm of functions of the form
Polysemy
Last edited: 2026-02-05
Polysemy
One word having multiple meanings in different context.
Pre-pruning decision trees
Last edited: 2026-02-05
Pre-pruning decision trees
When building a decision tree to avoid overfitting you may want to stop the branching process even if it could increase the accuracy on the training data . One method to do this is to put in requirements on the final form of the decision tree before you start training. The requirements might be:
Precision
Last edited: 2026-02-05
Precision
For some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define
Preference bias
Last edited: 2026-02-05
Preference bias
Preference bias occurs in the learning process of an algorithm. This will be from how the objective function chooses to go from one step to the next. This will generate preferences for different models.
Principle component analysis
Last edited: 2026-02-05
Principle component analysis is a linear dimension reduction algorithm. In concept principle component analysis find the axis along which the data has maximal variance if it were projected. It does this by finding the Eigenvector and Eigenvalues of the Covariance matrix . It uses these eigenvectors as a new basis of the data’s feature space.
Probably approximately correct learnable (PAC)
Last edited: 2026-02-05
Probably approximately correct learnable (PAC)
A concept class $C$ is probably approximately correct learnable by learner $L$ using hypothesis space $H$ with error $0 \leq \epsilon \leq 0.5$ and probability $0 \leq 1 - \delta \leq 1$ if and only if learner $L$ when ran on concept $c \in C$ with probability $1-\delta$ outputs hypothesis $h \in H$ such that the true error
Q-function (RL)
Last edited: 2026-02-05
Q-function (RL)
Suppose we are in a Markov decision process and using discounted rewards define the Q-function $Q : S \times A \rightarrow \mathbb{R}$ as the solution to the following equations
Recall
Last edited: 2026-02-05
Recall
For some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define
Regression problems
Last edited: 2026-02-05
Regression problems
Regression problems are a subset of supervised learning problems where the codomain $B$ is infinite.
Reinforcement learning
Last edited: 2026-02-05
Reinforcement learning
This is a branch of Machine Learning applied in a setting where a learner has a number of actions to take in a given state - each action will lead to a different reward and we want to maximise it. You are provided with previous transitions and want to build a policy in the sense of a Markov decision process .
Restart hill climbing
Last edited: 2024-02-24
# Random restart hill climbing
Suppose we have a optimisation problem where $A$ has some sense of neighbourhood $N(a)$ for each point $a \in A$.
Then we want to run Hill climbing but try to get around the issue with only finding local optimum by at the end of a run restarting in a new place.
Restriction Bias
Last edited: 2026-02-05
Restriction bias
Restriction bias occurs from choosing a modelling paradigm , $M \subset Func(A,B)$, as your candidate can not be some functions in the space $Func(A,B)$ you are biased towards the ones in $M$.
Rich clustering
Last edited: 2026-02-05
Rich clustering
Suppose we are in the clustering problem set up. A clustering algorithm $C$ is considered rich if for any assignment $f: T \rightarrow B$ there is a distance $d$ on $T$ such that the clustering algorithm will give you that assignment $C(d) = f$.
Sample complexity
Last edited: 2026-02-05
Sample complexity
Similar to run time complexity it is the number of samples required for a model to converge to the optimum hypothesis in terms of the problem size.
Scale-invariant clustering
Last edited: 2026-02-05
Scale-inverient clustering
Suppose we are in the clustering problem set up. A clustering algorithm $C$ is considered scale-invariant if for distance $d$ on $T$ if we scale the distance by some $\alpha$ (i.e. $d_{\alpha}(a, b) = d(a,b) \alpha$) then it still produces the same clustering $C(d) = c(d_{\alpha})$.
Sigmoid function
Last edited: 2026-02-05
Sigmoid function
The sigmoid function $\sigma: \mathbb{R} \rightarrow (0,1)$ is an attempt to continuously approximate the binary step activation function
Sigmoid kernel (SVM)
Last edited: 2026-02-05
Sigmoid kernel
The sigmoid kernel is a function that can be used in the Kernel trick for SVMs . It has two variables $a, b \in \mathbb{R}_{\geq 0}$. It is defined as
Simulated Annealing
Last edited: 2024-02-24
Suppose we have a optimisation problem where $A$ has some sense of neighbourhood $N(a)$ for each point $a \in A$.
Then we want to emulate Hill climbing but probabilistically allow for random walking. We do this by randomly deciding to take moves from $x$ to $x_t$ based on the temperature of the algorithm at that time $T$ this is given by
$$P(x,x_t,T) = \begin{cases} 1 & \mbox{if } f(x_t) \geq f(x)\\ \exp{\frac{f(x_t) - f(x)}{T}} & \mbox{otherwise.} \end{cases}$$
Then we just modify what is done in Hill climbing with this and gradually decrease the temperature and we have simulated annealing.
Simulated annealing ending probability
Last edited: 2026-01-28
# Statement
Lemma
In the Simulated Annealing algorithm with some unspecified assumptions, we have
Single linkage clustering
Last edited: 2024-03-09
Suppose we are in the clustering problem set up. Here we are given the number of desired clusters $k$, and the sense of distance $d$.
The idea will be to make every element in our training data $T$ its own cluster. Then we take the union of two clusters which are considered the closest. For this linkage clustering we use the minimum distance of points in the cluster. i.e.
Soft clustering
Last edited: 2024-03-09
Suppose we are in the clustering problem set up. Soft clustering will use Expectation Maximisation with the Gaussian distribution . Here we fix a $\sigma^2$ and then we have
$$ \mathbb{P}[X=x_i \vert \mu = \mu_j] = \exp\left[- \frac{1}{2} \sigma^2 (x_i - \mu_j)^2 \right]. $$
We will use a mean to optimise $\mu_j$ at each step.
# Correctness
- It will have all the downsides of Expectation Maximisation but we can use restarts to over come them.
Step function methods
Last edited: 2026-02-05
Step function methods
Suppose we have a function $f: \mathbb{R} \rightarrow \mathbb{R}$ a step function method uses step function to model this. Here we choose intervals of the domain to define different functions on each region.
Strongly relevant feature
Last edited: 2026-02-05
Strongly relevant feature
For a learning problem a feature $x_i$ of the input space is strongly relevant if by removing it it degrades the Bayeses optimal classifier .
Supervised learning
Last edited: 2026-02-05
Supervised learning
In supervised learning we are given an input domain $A$ and an output codomain $B$ with some observations $O \subset A \times B$ of a function $f^{\ast} : A \rightarrow B$. The goal is to approximate $f^{\ast}$ the best we can given these observations.
Synonymy
Last edited: 2026-02-05
Synonymy
Multiple words having the same meaning in a particular context.
Testing data
Last edited: 2026-02-05
Testing data
Within the modelling framework , testing data is a set of observations $T \subset A \times B$ which are normally samples from $X$ and $Y$. You would use this to assess models after they have been trained. This should be kept distinct from you Training data .
The curse of dimensionality
Last edited: 2024-01-22
In the modelling framework the curse of dimensionality states as the dimension of $A$ our feature space grows, the amount of training data we need to generalise accurately grows exponentially.
Tit for Tat
Last edited: 2026-02-05
Tit for Tat
In a symmetric two player game, the tit for tat strategy just copies the opponents move. In the first round it will do something random.
Training data
Last edited: 2026-02-05
Training data
Within the modelling framework , training data is a set of observations $T \subset A \times B$ which are normally samples from $X$ and $Y$. You would use this to train models. This should be kept distinct from your testing data .
Training error
Last edited: 2026-02-05
Training error
Suppose we are in the modelling framework with some training data $T$ and hypothesis space $H$. For some candidate hypothesis $h \in H$ the training error
Transforming discrete input for regression
Last edited: 2024-01-17
Suppose we are in the modelling framework and have been given a discrete input variable $d \in A'$ for a regression problem . If $d$ has $n$ possible values $v_1, v_2, \ldots, v_n$ we introduce $n-1$ dimensional space $\tilde{A} = \mathbb{R}^{n-1}$. Then we define the transformation $t: A' \rightarrow \tilde{A} = \mathbb{R}^{n-1}$
$$t(v_i) = \begin{cases} e_i & \mbox{if } 1 \leq i \leq n-1\\ 0 & \mbox{otherwise} \end{cases}$$
where $e_1, e_2, \ldots e_{n-1}$ are the basis vectors of $\tilde{A}$.
Transitions (MDP)
Last edited: 2026-02-05
Transitions (MDP)
When presented with a learning problem you are not provided with the Markov decision processes instead you will be provided with transitions which are the outcome of a previous action - i.e. a tuple $(s,a,r,s') \in S \times A \times \mathbb{R} \times S$ where $s \in S$ is the start state, $a \in A$ is the action taken, $r \in \mathbb{R}$ is the reward gotten and $s' \in S$ the is state you ended up in. These can either be pre-computed or given a state the learn can provide the action to find out what the transition was.
True error
Last edited: 2026-02-05
True error
Suppose we are in the modelling framework with feature space $A$ with some probability distribution $\mathbb{D}$ on it, a hypothesis space $H$ and the true concept $f: A \rightarrow B$. For some candidate hypothesis $h \in H$, the true error is:
Underfitting
Last edited: 2026-02-05
Underfitting
When training a model on training data , if your model does not imitate that data closely enough as you haven’t provided enough degrees of freedom - you have underfitting.
Unsupervised learning
Last edited: 2026-02-05
Unsupervised learning
In unsupervised learning we are given an input domain $A$ with some training data $T \subset A$ which are not labelled (you may or may not be provided with an output codomain $B$) . The goal is to make some $f: A \rightarrow B$ (and $B$) that groups items that are “related” to one another given $T$.
Useful feature
Last edited: 2026-02-05
Useful feature
For a learning problem, learning algorithm and measure of error a feature $x_i$ of the input space is useful if including it you can decrease the error.
Value iteration (MDP)
Last edited: 2024-04-06
This is a method of solving Markov decision processes . It uses discounted rewards to get a local as well as global optimum solution.
We ideally would like to work out what the best policy is
$$ \pi^{\ast} = \mbox{arg}\max_{\pi} \mathbb{E}\left [ \sum_{i=1}^{\infty} \gamma^i R(s_t) \ \Bigg \vert \ \pi \right ]. $$
To help us work out what $\pi^{\ast}$ should be lets define the utility of a state
Vapnik-Chervonenkis dimension
Last edited: 2026-02-05
Vapnik-Chervonenkis dimension
Suppose we are in the modelling framework with feature space $A$, domain $B$, and we have hypothesis space $H$. The VC-dimension of $H$ is the size of the largest set $S \subset A$ such that for all $l: S \rightarrow B$ there exists $h_l \in H$ such that $h_l(s) = l(s)$ for all $s \in S$.
Version space
Last edited: 2026-02-05
Version space
Suppose we are in the modelling framework with some training data $T$ and a hypothesis space $H$. The version space for $T$ is
Weak learner
Last edited: 2026-02-05
Weak learner
Suppose we are in the modelling framework a model $\hat{f}$ is called a weak learner if for all probability distributions $\mathbb{D}: A \rightarrow [0,1]$ there exists a sufficiently small $\epsilon$ such that
Weakly relevant feature
Last edited: 2026-02-05
Weakly relevant feature
For a learning problem a feature $x_i$ of the input space is weakly relevant if
Wrapping (feature selection)
Last edited: 2026-02-05
Wrapping (feature selection)
Suppose we have some learning problem where the input domain has a large number of features. Wrapping is the process of selecting features for a learning algorithm by using that algorithm to evaluate how it performs on different subsets of features.

Machine-Learning

Pages in this section

# Overview

# Statement

# Statement

# Statement

# Statement

# Random restart hill climbing

# Statement

# Correctness