Week 15 - Markov Chains
To talk about Markov chain it is good to recap what a probability matrix is:
Now we can define a Markov chain.
Suppose we have 4 states with the following probabilities to go between them. This is commonly represented as a edge weighted directed graph.
In terms of a probability matrix it would be represented as
where $p_{i,j}$ represents the edge weight going from $i$ to $j$ in the graph above.
Steps through the graph
In a Markov chain $P$’s term $p_{i,j}$ represents the probability of going from $i$ to $j$ in one step. However $P^k$’s terms represent the probability of going from $i$ to $j$ in $k$ steps, so this is still a probability matrix. In the example above
$$ P^2 = \left ( \begin{array} \ 0.35 & 0.25 & 0.25 & 0.15\\ 0.31 & 0.25 & 0.35 & 0.09\\ 0.06 & 0.21 & 0.64 & 0.09\\ 0.56 & 0.35 & 0 & 0.09\\ \end{array}\right ) $$as you can notice there is no path from state $4$ to $3$ in 2 steps.
Stationary distribution
In the example above as $k$ gets large $P^k$ converges in some sense to a single matrix. For very large $k$ we may have
$$ \lim_{k \rightarrow \infty} P^k = \left ( \begin{array} \ \cdots & \pi & \cdots\\ \cdots & \pi & \cdots\\ \vdots & \vdots & \vdots\\ \cdots & \pi & \cdots\\ \end{array}\right ) $$for some stationary distribution $\pi$. For the example above we have
$$ \pi = \left ( \begin{array} \ 0.244186 & 0.244186 & 0.406977 & 0.104651 \end{array} \right ). $$Formalising what we mean her we have.
The connection follows as to calculate $P \cdot P^k$ for each row we calculate something similar to $\pi P$.
Algebraic view
For some Markov chain given by $P \in M_{N, N}(\mathbb{R})$ if you start at state $i$ with $1 \leq i \leq N$ then the probability distribution of where you are is given by the $i$‘th row of $P$.
Equivalently, if $\mu_0 \in M_{1,N}(\mathbb{R})$ is the vector with $0$’s in all entries other than $1$ in the $i$‘th column then its probability distribution is given by
$$ \mu_0 P =: \mu_1 $$with the probability distribution of it after $k$ steps being
$$ \mu_0P^k = \mu_{k-1}P =: \mu_k. $$However, we can pick any probability distribution over the $N$ states for $\mu_0$. Therefore a stationary distribution is just one in which this doesn’t change over time.
From the definition any stationary distribution is simply an eigenvector for $P$ with eigenvalue 1.
When does a Markov chain not have a stationary distribution?
If the state directed graph is a bipartite graph then notice if we start on one side at every even step we are on that side - however for every odd step we are on the other side. This is a classic example of when no stationary distribution will exist. This can be summarised by the following definitions.
We can show periodic Markov chains have no stationary distribution.
When does a Markov chain not have a unique stationary distribution?
Suppose we have a Markov chain with multiple strongly connected component sinks in the strongly connected component graph. Then for each of these we will get a stationary distribution on the vertices involved. Then any weighted sum of these stationary distribution such that they sum to a probability distribution would be a stationary distribution on the whole Markov chain and it wouldn’t be unique.
Markov chain with a unique stationary distribution
Now we know what to avoid we define the following.
These Markov chains have the desired property.
They also have the observed limiting form.
What is the stationary distribution?
There is one class of Markov chain that has an easy stationary distribution to compute.
There is another form Markov chain that has an explicit form.
Though in general it does not have anything explicit.
Page Rank algorithm
There is an algorithm invented in 1998 that was used to rate the importance of webpages.
The way it determines importance has nice applications for Markov chains. It was used in search engines at the beginning of the popularisation of the internet.
For this lets start by defining the object we are working with.
We think of this graph in an extended Adjacency list form,
- $Out(x) = \{y \in V \vert (x,y) \in E\}$, and
- $In(x) = \{y \in V \vert (y,x) \in E\}$.
The problem is to define $\pi(x)$ to be the “rank” of a page - which will be a measure of importance.
How to construct the page rank
We want the page rank that obays some simple ideas.
- the value of a page should be linked to the pages linking to it,
- the value of a link should be inversely proportional to how many outbound links it has, and
- the value of a link should be proportional to the value of the page.
The simplest formula obeying these rules would be
$$\pi(x) = \sum_{y \in In(x)} \frac{\pi(y)}{\vert Out(y) \vert}.$$This is recursive however, so we don’t know if there will be a solution for $\pi$. We define this as the page rank.
Though the suggestive notation of $\pi$ and the idea of a limit should give you an idea that we will construct this via a stationary distribution of a Markov chain. In this world we already know when stationary distributions exist.
Random walk
Suppose we play a game on the Webgraph where we start at one page, then uniformly at random we hit one of the outbound hyperlinks on that webpage. This defines us a Markov chain on the Webgraph where
$$p_{y,x} = \begin{cases} \frac{1}{\vert Out(y) \vert} & \mbox{if } (y,x) \in E\\ 0 & \mbox{otherwise} \end{cases}.$$Then consider its stationary distribution from the perspective of
$$\pi = \pi P$$then for a particular position $x$ on the righthand side we have
$$\pi(x) = \sum_{y \in In(x)} \pi(y)p_{y,x} = \sum_{y \in In(x)} \frac{\pi(y)}{\vert Out(y) \vert}.$$Giving us the page rank we defined above.
Technical tweaks
The web we might be on might have horrible irregularities within it so we get periodic states or strongly connected components. Therefore we add the possibility to randomly jump from the page you are on to one picked uniformly at random from all the pages on the web. We will do this with probability $1 - \alpha$ for some $\alpha \in (0,1] \subset \mathbb{R}$.
Therefore we get the following Markov chain
$$p_{y,x} = \begin{cases} \frac{\alpha}{\vert Out(y) \vert} + \frac{1 - \alpha}{\vert V \vert} & \mbox{if } (y,x) \in E\\ \frac{1 - \alpha}{\vert V \vert} & \mbox{otherwise} \end{cases}.$$There is one last problem with this definition. What do we do when a page has no outgoing links? There are a couple approaches taken:
- Self-loop with all its weight,
- Remove such nodes (recursively), and
- Set $\alpha = 0$ for these nodes. The first solution makes these pages have a much higher page rank. The second solution leaves us with pages with no page rank. The third solution is practically what happens within the Page Rank algorithm.
This Markov chain is strongly connected as every vertex connects to every other - so it is irreducible. Every vertex has a self-loop so it is an aperiodic state - making the Markov chain aperiodic.
This tweaked Markov chain is an Ergodic Markov chain so has a unique stationary distribution. We use this to find the page rank.
How to compute $\pi$
This is the Page rank algorithm.
Computing $\pi$ is hard as $N$ therefore $P$ is huge. You will need to compute
$$\mu_0 P^t$$for some large $t$. There are some tricks that will help
- if $\alpha$ is sufficiently small, then $t$ doesn’t have to be too large,
- if we use an old approximation of $\pi$ for $\mu_0$ it will mix faster, and
- matrix multiplication doesn’t need to take $O(N^2)$ instead we can get it on the order of $O(\vert E \vert)$.
What value to set $\alpha$
If $\alpha = 0$ then we have a Symmetric Markov chain with just the uniform stationary distribution. So $\alpha$ in some sense is reflecting how much the tweaked Markov chain is reflecting the original idea of the page rank.
The trade off for increasing $\alpha$ is that it is slower to converge to a stationary distribution. Google are believed to use $\alpha = 0.85$.