Machine-Learning
Pages in this section
- AccuracyLast edited: 2026-02-05Accuracy
Suppose we some model $\hat{f}$ predicting $f$. For some test data $T$ we define
- Activation function
Last edited: 2026-02-05Activation functionAn activation function $a: \mathbb{R} \rightarrow \mathbb{R}$ gets applied in a perceptron after the weighted sum of the inputs. It is the non-linear term. Classic activation functions are
- Bayeses optimal classifier
Last edited: 2026-02-05Bayesian classificationSuppose we have a classification problem for some function $f: A \rightarrow B$. We have a hypothesis space $H$ and training data $T$. We want to work out what the best label is for $a \in A$. We can use maximum a posteriori probability to calculate Bayeses optimal classifier
- Binary step
Last edited: 2026-02-05Binary stepThe binary step is an activation function that determines what the value is in reference to a parameter $\theta$.
- Calculate polynomial regression coefficients for MSE
Last edited: 2024-01-17To demonstrate the calculation of the coefficients for polynomial regression with MSE suppose we have 1-dimensional input $A$ and training data $(x_i, y_i)$. For an order $k$ polynomial we are looking for coefficients $c_j$ for $0 \leq j \leq k$ that roughly do the following
$$ \left( \begin{array} \ 1 & x_1 & x_1^2 & \ldots & x_1^k\\ 1 & x_2 & x_2^2 & \ldots & x_2^k\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_n & x_n^2 & \ldots & x_n^k \\\end{array} \right) \left( \begin{array} \ c_1\\ c_2\\ \vdots\\ c_k \\\end{array} \right) \approx \left( \begin{array} \ y_1\\ y_2\\ \vdots\\ y_n \\\end{array} \right), \ \ \ \ \mbox{we call this} \ \ \ \ X C \approx Y $$in a completely not mathematically sound way we can do the following rearrangement
- Classification problems
Last edited: 2026-02-05Classification problemsClassification problems are a subset of supervised learning problems where the codomain $B$ is a finite set.
- Clustering Problem
Last edited: 2026-02-05Clustering problemSuppose we are in the unsupervised learning setting and we have a sense of distance on $A$, $d(\cdot, \cdot)$ (we assume $d(x,y) = d(y,x)$ for $x,y \in A$). We want to find $B$ a set of categories and a function $f: A \to B$ such that $f$ maps items that are close to one another to the same category.
- Concept
Last edited: 2026-02-05ConceptSuppose we have a set $A$ then a concept about $A$ is a function $f: A \rightarrow \{0,1\}$. I.e. tall people would map people who were considered tall to true and the rest to false.
- Concept class
Last edited: 2026-02-05Concept classGiven some underlying set $A$ a concept class is a set of concepts
- Consistent clustering
Last edited: 2026-02-05Consistent clusteringSuppose we are in the clustering problem set up. A clustering algorithm $C$ is considered consistent if for distance $d$ on $T$ we get clustering $C(d) = f$ then any distance $d'$ such that
- Consistent learner
Last edited: 2026-02-05Consistent learnerSuppose we are in the modelling framework with some training data $T$ and a hypothesis space $H$. A consistent learner $h \in H$ is one where $h(a) = b$ for all $(a,b) \in T$.
- Credit assignment problem
Last edited: 2026-02-05Credit assignment problemThe credit assignment problem in Reinforcement learning refers to the challenge of properly attributing credit or responsibility to actions that led to a particular outcome.
- Cross validation
Last edited: 2026-02-05Cross validationSuppose we are training a model on some training data $T$. If we partition $T$ into folds $T_i$ for $1 \leq i \leq k$. Then cross validation is the practice of training the model on all but one fold $T_j$ then assessing it on $T_j$ using our objective function . The practice usually involves doing this for all possible folds and picking the model with least error.
- Crossover (genetic algorithms)
Last edited: 2026-02-05Crossover (genetic algorithms)Suppose we have two points $a_1, a_2 \in A$. A crossover $C: A \times A \rightarrow A^k$ is a function that combines these points to produce offspring.
- Degrees of freedom
Last edited: 2026-02-05Degrees of freedomFor a model the degrees of freedom is the number of parameters you are able to change.
- Dependency Trees (Bayesian Network)
Last edited: 2026-02-05Dependency Trees (Bayesian Network)A Bayesian network $(G, X)$ is a dependency tree if $G$ is a directed tree . That is, every node has at most one parent. Therefore, there is a function $\pi: V \rightarrow V \cup \{\emptyset\}$ which defines every vertex’s parent or lack thereof, which gives
- Dimensionality reduction
Last edited: 2026-02-05Dimensionality reductionSuppose we are in the modelling framework . Dimensionality reduction is the process of taking our input space $A$ and reducing the number of features within it. It consists of a map $p: \mathbb{F}^N \rightarrow \mathbb{F}^M$ where you usually want $M < N$.
- Discounted rewards
Last edited: 2026-02-05Discounted rewardsDiscounted rewards is a way to evaluate a Markov decision process with infinite steps. Let $r_i$ for $i \in \mathbb{N}$ be a sequence of rewards for this decision process. Let $\gamma \in [0,1)$ be some constant. Then we set the value of these choices as
- Eager learner
Last edited: 2026-02-05Eager learnerA model is an eager learner if the learning process takes a long time but evaluation time is quick.
- Ensemble learning
Last edited: 2026-02-05Ensemble learningIn machine learning ensemble learning is the approach of combining different models on the same problem and combining their results to provide a new model.
- epsilon-exhausted version space
Last edited: 2026-02-05epsilon-exhausted version spaceSuppose we are in the modelling framework with some training data $T \subset A \times B$, a hypothesis space $H \subset Func(A,B)$ and a probability distribution $\mathbb{D}$ on $A$. For some $0 \leq \epsilon \leq 0.5$ the version space $VS_H(T)$ is $\epsilon$-exhausted if and only if for all $h \in VS_H(T)$ the true error
- Epsilon-greedy exploration
Last edited: 2026-02-05Epsilon-greedy exploration$\epsilon$-greedy exploration is a way of choosing actions in Q-learning . For a sequence $\epsilon: \mathbb{N} \rightarrow [0,1]$ at time step $t \in \mathbb{N}$ you choose action
- Error function (modelling)
Last edited: 2026-02-05Error function (modelling)In the modelling framework the error function determines how good the current model $\hat{f}$ is at fitting the training data $T$. This is a function maps from our parameter space to some evaluation space, usually $\mathbb{R}$ - if we let $o$ be the objective function it normally takes the form
- Error rate (modelling)
Last edited: 2026-02-05Error rate (modelling)Suppose we are in the modelling framework and have probability distribution $\mathbb{D}: A \rightarrow [0,1]$ on our domain . We define the error rate to be
- Expectation Maximisation
Last edited: 2024-03-09Expectation maximising is an algorithm which can be used on problems such as clustering problem . It is used in situations where you assume your data follows some underlying distribution $\mathbb{D}$ which in this instance is given by parameters $\theta$. We assume $\theta \in H$ our hypothesis space .
Suppose we draw $m$ samples from $\mathbb{D}_{\theta}$ which has an observed component $X = \{x_1, \ldots, x_m\}$ and an unobserved component $Z = \{z_1, \ldots, z_m\}$. Therefore the samples of $\mathbb{D}_{\theta}$ are given by $Y = X \cup Z$.
- Explore exploit dilemma
Last edited: 2026-02-05Explore exploit dilemmaIn Machine Learning there is a dilemma when training a model. You can explore the state space more, which can waste time in places that are irrelevant or you can exploit what you know more but can get you in a rut not finding new ways to do things. This same dilemma applies far beyond machine learning into literally everyone’s daily lives.
- F1 score
Last edited: 2026-02-05F1 scoreFor some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define the F1 score to be the harmonic mean of precision and recall . That is
- Filtering (feature selection)
Last edited: 2026-02-05Filtering (feature selection)Suppose we have some learning problem where the input domain has a large number of features. Filtering is the process of selecting features for some learning algorithm where we do not explicitly know what the learning algorithm is.
- Fold (cross validation)
Last edited: 2026-02-05Fold (cross validation)Given some training data $T$ for cross validation we partition $T$ into $T_i$ for $1 \leq i \leq k$. A fold of $T$ is one of these partitions.
- Gaussian kernel (SVM)
Last edited: 2026-02-05Gaussian kernelThe gaussian kernel is a function that can be used in the Kernel trick for SVMs . It has one variable $\sigma \in \mathbb{R}$ with $\sigma \not = 0$. It is defined as
- Genetic algorithm (meta)
Last edited: 2024-02-24Suppose we have a optimisation problem which as two properties associated to is domain $A$
- There is a method to make a small change to any $a \in A$ called a mutation $M: A \rightarrow \mathcal{P}(A)$ where any of $m(a)$ are a small change to $a$.
- It has some crossover function $C: A \times A \rightarrow A^k$.
Then for a collection of points $P \subset A$ called a population the algorithm assess the fitness of each point $f(a)$ for $a \in P$.
- Georgia Tech
Last edited: 2026-01-28# Overview
I am enrolled in the online Master of Science in Computer Science (OMSCS ) at Georgia Tech . I am pursuing this degree to properly understand machine learning and distributed computing.
- Gini index
Last edited: 2026-02-05Gini indexSuppose we have a discrete random variable $X$ that can take values in $\{1, 2, \ldots, k\}$. We define the Gini index of $X$ to be
- Gradient decent
Last edited: 2026-02-05This algorithm uses differentiation to minimise error terms for models. Suppose we have a model with some parameters $w \in \mathbb{R}^k$. Further suppose the error function $E(w)$ is differentiable with respect to $w$. Then we set some learning rate $\eta$ and we perturb the weights $w$ by $w \leftarrow w - \eta \frac{\partial E}{\partial w}$ until $w$ settles in some local minimum.
- Gridworld
Last edited: 2026-02-05GridworldA Grid world is any set of states that can be modelled as $S \subset \mathbb{Z}^n$. Where states can only transition to adjacent states in the grid system.
- Haussler Theorem
Last edited: 2026-01-28# Statement
Haussler TheoremSuppose we are in the modelling framework with finite hypothesis space $H$ and probability distribution $\mathbb{D}$ on $A$. Set $0 \leq \epsilon \leq 1$. Let our training data $T$ be $m$ i.i.d. samples from $\mathbb{D}$ with associated correct answers. For any hypothesis $h \in H$ which is consistent with $T$ we have a bound on the true error
- Hyperbolic tangent (tanh)
Last edited: 2026-02-05Hyperbolic tangentThe hyperbolic tangent function maps $tanh: \mathbb{R} \rightarrow (-1,1)$ by
- Impossibility Theorem
Last edited: 2026-01-28# Statement
LemmaAny clustering algorithm cannot be all of the following:
- Incremental learning
Last edited: 2026-02-05Incremental learningSuppose we have a sequence of learning rates $\alpha: \mathbb{N} \rightarrow \mathbb{R}$. Then a value $V$ incrementally learns some random variable $X$ if for $t \in \mathbb{N}$ with samples $x_t$ from $X$ we set
- Independent component analysis
Last edited: 2024-03-10Independent component analysis is a form of linear dimension reduction . The goal of independent component analysis is to form a linear map to features which are independent of one another.
Strictly if you previous features were $X_1, X_2, \ldots, X_n$ and you map to $Y_1, Y_2, \ldots, Y_m$ then we want the following statements about Mutual information :
- Inductive bias
Last edited: 2026-02-05Inductive biasThis is any preference in a machine learning algorithm to pick one model over another. Examples of this are:
- Information entropy
Last edited: 2026-02-05Information EntropySuppose we have a discrete random variable $X$ that can take values in $A$. We define the information entropy of $X$ to be
- Instance-based learning
Last edited: 2026-02-05Instance-based learningInstance-based learning is a branch of machine learning where instead of performing generalisations on the training data immediately and not refering back to them. It instead stores the instances of the training data and uses the when called for a value.
- Introduction to statistical learning with Applications in Python
Last edited: 2024-01-11This is a book by multiple authors about Machine Learning I was reading with a bunch of my friends. You can find it in my repo or for free on their website .
- Irreducible error
Last edited: 2026-02-05Irreducible errorIn any form of modelling, inherent irreducible error arises due to limitations in our knowledge, unobservable hidden variables, and measurement constraints. For a function $F: A \rightarrow B$, irreducible error is typically represented as a random variable $\epsilon$ in the target space $B$, with $\mathbb{E}[\epsilon] = 0$ —expressing that, on average, the irreducible error lacks preference in $B$.
- Irrelevant feature
Last edited: 2026-02-05Irrelevant featureFor a learning problem a feature $x_i$ of the input space is irrelevant it is not a Strongly relevant feature or Weakly relevant feature .
- Joint Entropy
Last edited: 2026-02-05Joint EntropySuppose we have two random variables $X$ and $Y$ over different domains $A$ and $B$. The joint entropy is defined by
- k-means clustering
Last edited: 2024-03-09Suppose we are in the clustering problem set up. Suppose we are provided $k$ the number of clusters. Further assume $A$ is a space with some method of averaging $Avg:2^A \rightarrow A$ which satisfies
$$ \sum_{x \in X} d(x, Avg(X)) = \min_{a \in A} \sum_{x \in X} d(x,a), \mbox{ for all } X \subset A. $$In the Euclidian example we will have
$$ Avg(X) = \frac{1}{\vert X \vert} \sum_{x \in X} x. $$To start the algorithm we pick $k$ random points in our training data $T$. We set these to be the current centres $center^0_i$ for $1 \leq i \leq k$. For iteration $t \in \mathbb{N}$ assume we have centres $center_i^{t-1} \in A$ (they don’t need to be points in $T$). Now we group the training data
- Kernel trick
Last edited: 2026-02-05Kernel trickSuppose we are in the modelling framework with training data $T \subset A \times B$. When using SVMs we want to find a hyperplane that linearly separates the data - though this might not be possible for the current embedding of $T$ in $A$. Though it might be possible for a map
- Lazy learner
Last edited: 2026-02-05Lazy learnerA model is a lazy learner if it puts off compute until execution time.
- Length of a probability
Last edited: 2026-02-05Length of a probabilityFor an event $A$ if $\mathbb{P}[A] = p$ then we say the length of $A$ is
- Linear dimensionality reduction
Last edited: 2026-02-05Linear dimensionality reductionThis is dimension reduction where we limit the types of maps we do to be linear maps .
- Linear regression
Last edited: 2026-02-05Linear regressionLinear regression is the special case of polynomial regression where the degree of the polynomial is 1. Therefore we are restricted to the modelling paradigm of
- Machine Learning
Last edited: 2026-02-05Machine LearningA computer programme is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
- Machine Learning by Tom Mitchell
Last edited: 2024-01-10This is a book by Tom Mitchell about Machine Learning that was the course text book for CS7641 Machine Learning . You can find it the repo below or for free on his website .
- Margin for a linear separator
Last edited: 2026-02-05Margin for a linear separatorSuppose we have a set of linearly separable points $X = X_1 \cup X_2 \subset \mathbb{R}^n$ such that $w \in \mathbb{R}^n$ and $b \in \mathbb{R}$ separate them. We define the margin of the hyperplane $H = (w, b)$ with respect to $X$ to be
- Markov decision process
Last edited: 2026-02-05Markov decision processA Markov decision process is defined by the following data:
- Maximum a posteriori probability estimate (MAP)
Last edited: 2026-02-05Maximum a posteriori probability estimate (MAP)Suppose we have a hypothesis space $H$ and we want to pick the best hypothesis given some data $D$. Further more suppose we have prior belief about the likelihood of each hypothesis represented by a probability distribution over $H$. The maximum a posteriori probability estimate is
- Maximum likelihood estimation (MLE)
Last edited: 2026-02-05Maximum likelihood estimation (MLE)Suppose we have a hypothesis space $H$ and we want to pick the best hypothesis given some data $D$. The maximum likelihood estimation is
- Mean squared error (MSE)
Last edited: 2026-02-05Mean Squared Error (MSE)The mean squared error is the square of the $l_2$ norm for two points in $B^k$, i.e.
- Mistake bound
Last edited: 2026-02-05Mistake boundDuring an infinite run of a learning task. What is maximum number of mistakes the algorithm could make in terms of the input size.
- Model-based reinforcement learning
Last edited: 2026-02-05Model-based reinforcement learningModel-based reinforcement learning is a method of performing reinforcement learning where you use the provided transitions to build your model of the problem, then use a planner such as Value iteration (MDP) or Policy Iteration (MDP) to produce a policy.
- Modelling bias
Last edited: 2026-02-05Modelling biasDefinition here
- Modelling framework
Last edited: 2026-02-05Modelling framworkSuppose we have some random variable we want to predict $Y$ (over a space $B$) and some set of features or predictors in $A$ to make predictions of $Y$ this are sampled from a random variable $X$. You assume there is some relationship between $Y$ and $X$ given by
- Modelling paradigm
Last edited: 2026-02-05Modelling paradigmIn the modelling framework , normally the space of all function $f: A \rightarrow B \in Func(A,B)$ is much too large. Therefore we choose some modelling paradigm $M \subset Func(A,B)$ to pick our candidates from. For example, in linear regression , we choose to pick linear functions is some predefined set of inputs.
- Mutation (genetic algorithms)
Last edited: 2026-02-05Mutation (genetic algorithms)In genetic algorithms a mutation is a small local change.
- Naive Bayes classifier
Last edited: 2024-02-24Suppose we have a classification problem and we are in the modelling framework . We are given features $A = \oplus_{i=1}^n A_i$ where each feature $A_i$ has finite values (also split up the random variable $X$ into $X = \oplus_{i=1}^n X_i$ for our input). Let our classification also have finite domain $B$.
We can use our testing data to estimate $\mathbb{P}[Y=b]$ for each $b \in B$ and $\mathbb{P}[X_i = a_i \vert Y = b]$ for each $a_i \in A_i$ and $b \in B$.
- Objective function
Last edited: 2026-02-05Objective functionAn objective function $o: S \rightarrow E$ is simply a method to assess how “well” some function is doing. The evaluation space $E$ should ideally be totally ordered .
- Optimisation problem
Last edited: 2026-02-05Optimisation problemFor an optimisation problem we are given an input space $A$ and a fitness function $f: A \rightarrow B$ where $B$ is ordered (usually $\mathbb{R}$). We want to find
- Optimistic exploration
Last edited: 2026-02-05Optimistic explorationOptimistic exploration is a way of choosing actions in Q-learning . Here we set the initial values of our $\hat{Q}$ to all be very high and we always pick the action that maximises $\hat{Q}$. Then in uncertainty it will explore actions it does not know about.
- Overfitting
Last edited: 2026-02-05OverfittingWhen training a model on training data , if your model too closely imitates that data without being able to generalise to the test data - you have overfitting.
- PAC learnable bound with VC-dimension
Last edited: 2026-01-28# Statement
LemmaFor a learner with a hypothesis space $H$ with VC dimension $VC(H)$, if we draw
- PAC-learnable if and only if finite VC dimension
Last edited: 2026-01-28# Statement
LemmaA hypothesis space $H$ is PAC learnable if and only if its VC dimension is finite.
- Perceptron (neural network)
Last edited: 2026-02-05Perceptron (neural network)A perceptron is a function $p: \mathbb{R}^n \rightarrow \mathbb{R}$ defined by two steps - first a weighted linear sum, the second an activation function . To define this we need weights $w_i$ for $1 \leq i \leq n$ and an activation function $a: \mathbb{R} \rightarrow \mathbb{R}$. Where
- Polynomial kernel (SVMs)
Last edited: 2026-02-05Polynomial kernelThe polynomial kernel is a function that can be used in the Kernel trick for SVMs . It has two variables the degree $d \in \mathbb{N}$ and constant $c \in \mathbb{R}$. It is defined as
- Polynomial regression
Last edited: 2026-02-05Polynomial regressionIn the modelling framework saying we are doing polynomial regression is saying that we are picking the modelling paradigm of functions of the form
- Polysemy
Last edited: 2026-02-05PolysemyOne word having multiple meanings in different context.
- Pre-pruning decision trees
Last edited: 2026-02-05Pre-pruning decision treesWhen building a decision tree to avoid overfitting you may want to stop the branching process even if it could increase the accuracy on the training data . One method to do this is to put in requirements on the final form of the decision tree before you start training. The requirements might be:
- Precision
Last edited: 2026-02-05PrecisionFor some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define
- Preference bias
Last edited: 2026-02-05Preference biasPreference bias occurs in the learning process of an algorithm. This will be from how the objective function chooses to go from one step to the next. This will generate preferences for different models.
- Principle component analysis
Last edited: 2026-02-05Principle component analysis is a linear dimension reduction algorithm. In concept principle component analysis find the axis along which the data has maximal variance if it were projected. It does this by finding the Eigenvector and Eigenvalues of the Covariance matrix . It uses these eigenvectors as a new basis of the data’s feature space.
- Probably approximately correct learnable (PAC)
Last edited: 2026-02-05Probably approximately correct learnable (PAC)A concept class $C$ is probably approximately correct learnable by learner $L$ using hypothesis space $H$ with error $0 \leq \epsilon \leq 0.5$ and probability $0 \leq 1 - \delta \leq 1$ if and only if learner $L$ when ran on concept $c \in C$ with probability $1-\delta$ outputs hypothesis $h \in H$ such that the true error
- Q-function (RL)
Last edited: 2026-02-05Q-function (RL)Suppose we are in a Markov decision process and using discounted rewards define the Q-function $Q : S \times A \rightarrow \mathbb{R}$ as the solution to the following equations
- Recall
Last edited: 2026-02-05RecallFor some binary classification problem where we are using $\hat{f}: A \rightarrow \{1, -1\}$ to predict $f: A \rightarrow \{1, -1\}$ for some testing data $T$ we define
- Regression problems
Last edited: 2026-02-05Regression problemsRegression problems are a subset of supervised learning problems where the codomain $B$ is infinite.
- Reinforcement learning
Last edited: 2026-02-05Reinforcement learningThis is a branch of Machine Learning applied in a setting where a learner has a number of actions to take in a given state - each action will lead to a different reward and we want to maximise it. You are provided with previous transitions and want to build a policy in the sense of a Markov decision process .
- Restart hill climbing
Last edited: 2024-02-24# Random restart hill climbing
Suppose we have a optimisation problem where $A$ has some sense of neighbourhood $N(a)$ for each point $a \in A$.
Then we want to run Hill climbing but try to get around the issue with only finding local optimum by at the end of a run restarting in a new place.
- Restriction Bias
Last edited: 2026-02-05Restriction biasRestriction bias occurs from choosing a modelling paradigm , $M \subset Func(A,B)$, as your candidate can not be some functions in the space $Func(A,B)$ you are biased towards the ones in $M$.
- Rich clustering
Last edited: 2026-02-05Rich clusteringSuppose we are in the clustering problem set up. A clustering algorithm $C$ is considered rich if for any assignment $f: T \rightarrow B$ there is a distance $d$ on $T$ such that the clustering algorithm will give you that assignment $C(d) = f$.
- Sample complexity
Last edited: 2026-02-05Sample complexitySimilar to run time complexity it is the number of samples required for a model to converge to the optimum hypothesis in terms of the problem size.
- Scale-invariant clustering
Last edited: 2026-02-05Scale-inverient clusteringSuppose we are in the clustering problem set up. A clustering algorithm $C$ is considered scale-invariant if for distance $d$ on $T$ if we scale the distance by some $\alpha$ (i.e. $d_{\alpha}(a, b) = d(a,b) \alpha$) then it still produces the same clustering $C(d) = c(d_{\alpha})$.
- Sigmoid function
Last edited: 2026-02-05Sigmoid functionThe sigmoid function $\sigma: \mathbb{R} \rightarrow (0,1)$ is an attempt to continuously approximate the binary step activation function
- Sigmoid kernel (SVM)
Last edited: 2026-02-05Sigmoid kernelThe sigmoid kernel is a function that can be used in the Kernel trick for SVMs . It has two variables $a, b \in \mathbb{R}_{\geq 0}$. It is defined as
- Simulated Annealing
Last edited: 2024-02-24Suppose we have a optimisation problem where $A$ has some sense of neighbourhood $N(a)$ for each point $a \in A$.
Then we want to emulate Hill climbing but probabilistically allow for random walking. We do this by randomly deciding to take moves from $x$ to $x_t$ based on the temperature of the algorithm at that time $T$ this is given by
$$P(x,x_t,T) = \begin{cases} 1 & \mbox{if } f(x_t) \geq f(x)\\ \exp{\frac{f(x_t) - f(x)}{T}} & \mbox{otherwise.} \end{cases}$$Then we just modify what is done in Hill climbing with this and gradually decrease the temperature and we have simulated annealing.
- Simulated annealing ending probability
Last edited: 2026-01-28# Statement
LemmaIn the Simulated Annealing algorithm with some unspecified assumptions, we have
- Single linkage clustering
Last edited: 2024-03-09Suppose we are in the clustering problem set up. Here we are given the number of desired clusters $k$, and the sense of distance $d$.
The idea will be to make every element in our training data $T$ its own cluster. Then we take the union of two clusters which are considered the closest. For this linkage clustering we use the minimum distance of points in the cluster. i.e.
- Soft clustering
Last edited: 2024-03-09Suppose we are in the clustering problem set up. Soft clustering will use Expectation Maximisation with the Gaussian distribution . Here we fix a $\sigma^2$ and then we have
$$ \mathbb{P}[X=x_i \vert \mu = \mu_j] = \exp\left[- \frac{1}{2} \sigma^2 (x_i - \mu_j)^2 \right]. $$We will use a mean to optimise $\mu_j$ at each step.
# Correctness
- It will have all the downsides of Expectation Maximisation but we can use restarts to over come them.
- Step function methods
Last edited: 2026-02-05 - Activation function