In the last article, we talked about how with the correct mathematical definition, we can define a circle, something that usually lives in two dimensions, in any dimension we want. Hopefully, that exercise has helped make it a bit clearer what we mean when we talk about high-dimensional data.
Dimensionality reduction is a technique in which one can put in high-dimensional data (like the example in the last article about films) and run it through an algorithm and try to figure out which dimensions are the most important for understanding similarities and differences between these data points. And these important dimensions need not be the ones in the original data. For example, it might be that separately, the rating and the genre aren’t that important for understanding the similarities between films, but the interaction between rating and genre is very helpful.
There are some clever algorithms out there to do just this. We will compare the results of three dimensionality reduction techniques: PCA (Principle Component Analysis), t-SNE (t-distributed Stochastic Neighbor Embedding), and UMAP ([Uniform Manifold Approximation and Projectio).
We can start with a simple example. Before we try to look at extremely high dimensional data in lower dimensions, let’s see what these algorithms can do when they try to project some three dimensional data into two dimensions.
We will start with a picture that has some obvious three-dimensional characteristics, the so-called ‘’swiss roll’’-dataset:
This is what happens when we ask PCA, t-SNE and UMAP to show us this dataset in only two dimensions, preserving as much of the properties of the 3-dimensional set as possible:
As we can see, PCA does alright, but isn’t able to ‘’unroll’’ the swiss roll, t-SNE does a bit of unrolling but doesn’t manage to fully flatten out the roll, and UMAP seems to do a pretty decent job!
Now, let’s consider a more complex example — a set of pictures. Pictures are an example of high-dimensional data because for an image in black and white, if we represent white as 0 and black as 1 (and shades of gray in between), then we could represent a 28 x 28-pixel image as a vector of length 28 x 28=784, where each entry in the vector is a value between 0 and 1. Then, hypothetically, if we wanted to see which images were similar to each other, we could plot them in a 784-dimensional space and see which ones are close to each other. But, as you might have imagined, that’s a bit tricky. Instead, let’s use dimensionality reduction!
For the pictures in question, we will use a famous image set: the MNIST dataset, which is a collection of 28 x 28-pixel images of hand-drawn numbers from 0–9. We will use dimensionality reduction as a tool for clustering and see if we can get the different digits to be grouped together. We feed the algorithms 15000 images and let them try to reduce the data to three dimensions, this is what we get: (Note — the algorithms did not see the labels of the data. Those were added afterwards in order to plot with a different color for each digit).
Here, we can see that t-SNE and UMAP both do a decent job of separating the different digits from one another, though there is a little bit of confusion between ones and sevens (as those can be written in different ways). We also see that the groupings are (especially in UMAP) somewhat related to features such as ‘’a straight line at the bottom’’, i.e. 4,9 and 7 and ‘’curvy parts’’ i.e. 2, 3, 5, 6, 8. None of these are perfect, but the amount of information that is retained even after getting rid of 781 dimensions is quite astounding! These tools and other similar ones can be incredible helpful when looking for similarities in high-dimensional data.
This is the second part of our Dimensionality Reduction series. You can find part 1 here.
Author: Emily Searle-White, ZDF Digital
 We need to set an order for our vector, so say we list the pixel values going from top to bottom and left to right. That way, we know which value corresponds to which spot in the picture.
 We see some 8s are grouped with the 1s, but when you imagine writing a very skinny 8, it is easy to see how this clustering could occur!