word2vec, doc2vec and nlp

So the idea of representing a document as a vector in a high dimensional space has apparently been around for a long time.

word2vec

There is a famous word2vec function created by Google using a shallow neural network that they trained on large amounts of google news data. They wrote an article about it:

“Efficient Estimation of Word Representations in Vector Space”, Mikolov et al… (2013)

This apparently made quite a splash in the nlp community, and that 2013 paper currently has 350 citations. In reality, they presented two different algorithms for generating their word2vec function. The word2vec function values depend on the corpus used to train it.

Semantic textual similarity

Semantic textual similarity (STS) is the nlp task of determining whether two sentences or documents mean the same thing. One approach to STS is to simply compute the average of the word2vec vectors for each word in the two documents. Obviously, this won’t be anywhere near perfect. It is completely independent of the order of the words.

The oddity of large dimensional spaces

It is quite possible that meanings of many simple texts could be each represented by a vector in a large dimensional space for the mere reason that there is just so much ROOM in a large dimensional space. Take the space of 200 dimensional vectors whose components are between zero and one. It is a 200 dimensional cube. This cube has 2^200 different corner points. More tellingly, in the Euclidean metric, practically any two points you choose at random are very close to being $\sqrt{200\alpha}}$ away from each other, where $\alpha$ is the expected value of $(x-y)^2$ where $x,y$ are both random variables uniformly selected from $[0,1]$. Put differently, almost all points in this strange space are almost the same Euclidean distance from each other.

Also, drawing a vector two any two random points chosen from this space those vectors are essentially at the same angle to each other. Put differently, almost all pairs of points in this space are at nearly the same cosine distance from each other.

The fact that most points are nearly the same distance (either Euclidean or cosine) from each other means that when we find exceptions, those exceptions can be expected to be meaningful.

Linearity of word2vec

One of the most notable features of word2vec is linearity. A classical example is that if one takes words and replaces them with their vectorizations then the nearest word vector to ‘king’ – ‘man’ is the word vector for ‘queen’. Thus not only does word2vec give a vector signature to a word, but simply vector arithmetic behaves somewhat like combining or removing notions.

doc2vec

Google later announced a vectorization method for documents rather than words. Apparently researchers had a difficult time replicating google’s claims about their algorithm, since there was a paper written that specifically addressed these problems:

“An empirical evaluation of doc2vec with practical insights into document embedding generation” by Lau and Bladwin (2016)

Taking a quick look at this paper, it seems that the main problem seemed to be that google’s results required a very large training corpus for the doc2vec function to give good results.

Latent semantic analysis

Latent semantic analysis is another standard nlp method. It is based on singular value decomposition. A matrix is formed form a large corpus where each column represents a paragraph of the corpus and each row is a word, with the entries of the matrix being the word counts in the paragraph for that column. Document sizes other that paragraphs can be used as well.

The entries are actually not usually straight word counts. The are tf-idf which means that they are the count of the word in the document divided by the number of times the word appears in the entire corpus. It is an acronym for term frequency – inverse document frequency. Probably really ought to be inverse corpus frequency, but the terms was probably coined before computers where making enormous corpuses tractable for computation with.

The singular value decomposition of the matrix can be interpreted very much the way that principle component analysis of a matrix. One obtains a reduction of dimension made up of weighted combinations of words which are supposed to capture underlying ideas.

My experience with PCA is that it is magic, but it is imperfect magic. Essentially it works by assuming that different factors will have effects at different scales. But the scale of a factors effect is a number, and so necessarily the size of the effects of the different factors are really just a a bunch of a numbers, and a lot of them are apt to be somewhat clumped together. That means that rather than having cleanly distinguished latent factors, one is more likely to get a sliding scale of blurred together factors, where one can see that each given factor is blurred together with its neighbors, without a good way to untangle them.

I expect one sees the same thing with latent semantic analysis because it is based on singular value decomposition.

Latent Dirichlet Allocation

Latent dirichlet allocation is an improvement to LSA (latent semantic analysis) in which topics are assumed to appear with a dirichlet distribution. I haven’t looked into how this algorithm plays out in detail.

Leave a comment