Exploring Word2Vec in Clojure with DL4J
We’re going to explore Word2Vec by walking through the Deep Learning for Java (DL4J) Word2Vec documentation using Clojure.
I’m going to assume you’re comfortable working in Clojure and that you’ve perhaps heard of Word2Vec, but haven’t delved into how it works.
Introduction to Word2Vec
Word2Vec is a method of turning text into a mathematical representation that neural networks can easily understand. As our source text puts it:
Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.
This numerical representation of words is achieved by looking at the context in which a word appears in the source corpus. With enough examples, the algorithm can develop a highly accurate idea about which words tend to be used near each other. It does all this through a “dumb” mechanical process, without human intervention.
This proximity of words to each other is represented numerically in words’ feature vectors. By treating each word vector as a point in high-dimensional space, the co-occurrence of words can be encoded as distance along those many dimensions.
The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words.
This approach turns out to be unreasonably effective at establishing the sense of a word and its relations to other words. These numerical representations can be surprisingly insightful when based on sufficiently large source corpora. Word2Vec’s vectors of numbers used to represent words are called neural word embeddings.
A well trained set of word vectors will place similar words close to each other in that space. The words oak, elm and birch might cluster in one corner, while war, conflict and strife huddle together in another.
Similar things and ideas are shown to be “close” in this high-dimensional space, translating their relative meanings into measurable distances. Qualities become quantities so algorithms can do their work. But similarity is just the basis of many associations that Word2vec can learn.
For example, a trained Word2Vec model can gauge relations between words of one language, and map them to another. This diagram shows (in two dimensions, reduced from hundreds) the relative positions of various words in vector spaces representing English and Spanish:
Let’s interactively explore how linguistic wonders like these are achieved. We’ll walk through turning some corpora into Word2Vec models and then see what we can do with them. Our process will be:
- set up Word2Vec
- build a toy Word2Vec model
- evaluate our model
- visualize the model
- import some non-toy models
- explore relationships using word vectors
Word2Vec Setup
We will work directly with DL4J’s Java classes from Clojure. It’s true that we could avoid Java interop by using a wrapper library like dl4clj-nlp, dl4clj, or jutsu.ai, but the concepts and API of Word2Vec are a fair amount to cover already. Putting a layer of abstraction between us and the material introduces extra concerns that aren’t helpful in this context. Simple interop with its host is one of Clojure’s central design features, so let’s default to that unless there’s a significant gain to be made with a wrapper library.
You’ll need a slew of Java classes and a couple Clojure libs. They’re listed below, but if you plan to follow along with code then I recommend working from the sample repo.
(:require [clojure.java.io :as io]
[clojure.string :as string])
(:import [org.deeplearning4j.text.sentenceiterator BasicLineIterator]
[org.deeplearning4j.models.word2vec Word2Vec$Builder]
[org.deeplearning4j.text.tokenization.tokenizer.preprocessor CommonPreprocessor]
[org.deeplearning4j.text.tokenization.tokenizerfactory DefaultTokenizerFactory]
[org.deeplearning4j.models.embeddings.loader WordVectorSerializer]
[org.nd4j.linalg.factory Nd4j]
[org.nd4j.linalg.api.buffer DataBuffer$Type]
[org.datavec.api.util ClassPathResource]
[org.deeplearning4j.plot BarnesHutTsne$Builder])
You may notice a significant differences between this document and the Java source it is based on: the lack of log
statements. That is quite intentional. This exploration is intended to be read in one of two ways:
- on the web, trusting my report of what evaluates to what, or
- in an editor-integrated REPL where you evaluate statements yourself
One benefit of option 2 over executing a Java object from afar is that results appear in your REPL/editor interactively. This hands-on approach favored by lisps means there is no need to pepper the code with log statements. For more on this approach to workflow, read Arne Brasseur’s The Bare Minimum, or Making Mayonnaise with Clojure and this community discussion on Clojure editor integration.
Building the Model
With dependencies taken care of, we can start creating a Word2Vec model to play with. We’ll build our first model over the raw_sentences.txt corpus, provided by DL4J as a toy dataset. Make sure to peek around that file to get a sense for what our model will be based on. Here’s a snippet:
All three times were the best in the state this season . She ’s here , too . Well , you are out . He did not say what they were . Is now a good time ? That was only two years ago , he said . He has nt . Where did the years go ? And that ’s just what he said this time . How did you do that ? You can go on and on . I did nt want to take it . Most of the music ’s just not very good .
Notice that each sentence is on its own line and that some “words” were created by an earlier preprocessing step that split contractions. Be aware as well that the small size of this dataset restricts the relationships that word2vec can detect between words, thus limiting the richness of the resulting word vectors. That’s okay. This is just an exercise to see how the process works, so we don’t need to spend hours of compute-time creating a highly accurate model.
Set aside our dataset’s limitations and consider how we will build the model. We want an instance of DL4J’s Word2Vec class, which uses the Builder pattern instead of direct constructors. This particular design pattern helps avoid unnecessary complexity in scenarios (like ours) where the class can be created with a wide variety of configuration options. Here’s how it looks:
;; Build word2vec model
(def model
(-> (Word2Vec$Builder.)
(.minWordFrequency 5)
(.iterations 1)
(.layerSize 100)
(.seed 42)
(.windowSize 5)
(.iterate (BasicLineIterator. "resources/raw_sentences.txt"))
(.tokenizerFactory (doto (DefaultTokenizerFactory.)
(.setTokenPreProcessor (CommonPreprocessor.))))
(.build)))
In the code above, we define our model
(called “vec” in the original Java) using Word2Vec’s Builder subclass. The Builder process:
- encapsulates the model’s hyperparameters
- declares its input corpus and how to interpret it
- returns a Word2Vec object
Let’s examine each of those steps in turn.
1. Encapsulating hyperparameters
After creating the Builder instance itself, the first five lines modifying it encapsulate details about the model itself:
minWordFrequency
is how many times a word must appear in the corpus to be learned by the model. This is because learning useful assocations between words requires them to be used in multiple contexts. It’s reasonable to use larger minimum values when working with very large corpora.layerSize
is the number of features we give our word vector, which determines the number of dimensions of the vector space our words will be placed in. Our words will be represented with a layer size of 100, meaning 100 features, giving us points placed in 100-dimensional space. Why 100 and not 5 or 5000? The field doesn‘t have a good explanatory theory yet, because like most hyperparameters it was arrived at empirically. Practitioners have found through trial and error that we need more than a few dozen layers to get the benefits of high-dimensionality but we need to balance that with the costs of computation that come with lots of layers. There are also diminishing returns in performance after a few hundred layers.windowSize
defines the width, in words, of our view as we search through the corpus looking for words near each other. A window size of 5 means we look at one center word, the two words to its left, and the two words to its right. So for the sentence “That was only two years ago he said”, the first window would be["that" "was" "only" "two" "years"]
, with “only” being the center word. After the model uses that context to improve the vectors for “only“, it moves to the second window, now around the word “two“:["was" "only" "two" "years" "ago"]
.
For the rest, see the Word2Vec$Builder javadocs.
2. Defining input and interpretation
The last part of configuring our model is to give it input and define how that input should be interpreted. This means defining what corpus we will .iterate
over, which we do in the line nominating the file raw_sentences.txt. Then we tell it how to read that file by defining a tokenizerFactory
.
If you look at the original Java you will notice that the BasicLineIterator
is named iter
and the tokenizerFactory
is called t
. Each of these names is used precisely once. The names, too, are not terribly descriptive. Functional programming wisdom suggests avoiding the mental effort of remembering one-time-use variables. Therefore instead of robotically creating Clojure vars to match the Java, we inline the line iterator and tokenizer factory. (This is a good example of the point-free style, in which unnecessary names are elided.)
3. Returning a Word2Vec object
The final .build
of our thread-first macro returns the configured Word2Vec model
. Now all we have to do with the returned Word2Vec object is run the fitting process: (.fit model)
.
This gives us a word2vec model built over our toy dataset. Next, let’s see what we can do with this fitted model.
Evaluating the model
OK, we built a Word2Vec model. But...how do we know that it is fitted correctly? After all, each word is represented by a 100-dimensional vector of real numbers:
(.getWordVectorMatrix model "day")
;; #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x59073fb5 "[0.41, 0.21, 0.15, -0.21, -0.04, -0.40, -0.12, -0.10, -0.32, 0.35, 0.21, 0.28, 0.12, -0.07, 0.05, -0.07, -0.20, 0.21, 0.14, -0.15, 0.07, 0.20, 0.42, -0.23, 0.10, -0.40, 0.11, -0.42, -0.19, -0.11, 0.29, -0.00, 0.46, -0.51, 0.14, -0.23, 0.08, -0.21, -0.07, 0.10, -0.31, -0.19, 0.11, 0.21, -0.07, -0.12, -0.47, -0.16, 0.16, -0.14, 0.28, 0.04, 0.24, -0.14, -0.35, 0.09, -0.24, -0.07, 0.16, -0.46, -0.28, -0.01, 0.15, 0.43, 0.16, 0.04, 0.04, 0.19, -0.25, -0.35, 0.24, -0.06, 0.18, -0.01, -0.03, 0.10, 0.06, 0.11, 0.13, 0.04, -0.03, -0.19, -0.45, 0.12, -0.00, 0.04, 0.17, -0.34, -0.03, -0.18, -0.11, 0.01, 0.15, -0.06, -0.19, 0.25, 0.01, 0.28, -0.32, -0.11]"]
It’s not feasible for us humans to manually check those hundred dimensions. For one, it would be a massive amount of hand computation. But more importantly, the vectors have no inherent meaning. They only represent meaning in their relation to other words’ 100-dimensional vectors. This is a central challenge of working with Word2Vec or neural nets in general. Compared to human-sized data structures containing words or obvious data about words, Word2Vec models and their operations are opaque.
But we can interact with our model. It’s just that we must inspect our model by asking it questions about the words it contains. For instance, we can ask our Word2Vec model, “which 10 words are closest to day
?”
(.wordsNearest model "day" 10)
;; => ["night" "week" "year" "game" "season" "group" "time" "office" "-" "director"]
...or, what is the cosine similarity of day
and night
? (As the DL4J docs put it, “the closer it is to 1, the more similar the net perceives those words to be.”)
(.similarity model "day" "night")
;; => 0.7328975796699524
By spot-checking individual words, we are able to “eyeball whether the net has clustered semantically similar words”. This is inexact verification, but that’s an inherent drawback to the power Word2Vec provides.
We can also inspect our model by doing some basic “word arithmetic”. The classic example is “king - queen = man - woman”. Unfortunately our toy dataset doesn’t include the words “king” or “queen” and so knows nothing about this relationship:
(.wordsNearest model "king" 1)
;; => []
(.wordsNearest model "queen" 1)
;; => []
We’ll return to this example later with a dataset that has more to say on the matter.
For now we must make do with some of our own equations on this baby data set. What words are near “house”?
(.wordsNearest model "house" 5)
;; => ["office" "company" "country" "family" "center"]
Sounds reasonable. What if we remove the business side of the word, by subtracting “office”? We can find out, because the wordsNearest
method can also take a list of positive words, a list of negative words, and a number of results to return:
(.wordsNearest model ["house"] ["office"] 3)
;; => ["family" "part" "life"]
House minus office equals family. Not bad! But don’t read too deeply into these results---they’re based on quite a limited sample corpus. We’ll do some more interesting word math in a future section.
Saving the model
To make it easier for us to come back to this work at another time, we’ll save our fitted model to a file. Then we’ll be able to to use this model without re-running our build and fit process. That wouldn’t just waste CPU cycles---since the fitting process isn’t fully deterministic, it would also ruin reproducibility.
To save the model, we invoke the writeWord2VecModel
method to serialize it as a ZIP file to the path we nominate. (Ignore the original Java’s use of the deprecated writeWordVectors method.)
(WordVectorSerializer/writeWord2VecModel model "serialized-word2vec-model.zip")
Visualizing the model
The next step in evaluating a model is to visualize it. But our word vectors are in 100-dimensional space. Personally, I live and work in 3-dimensional space, and even with tricks like using color or time to represent another handful, a hundred dimensions is far too many to plot. Whenever visualizing high-dimensional information you need some way to reduce the number of dimensions in the data to the handful that we humans can comprehend. One dimensionality reduction technique you may have heard of is principal component analysis (PCA). We’ll use a similar approach to dimensionality reduction called TSNE. Let’s let one of its inventors, Laurens van der Maaten, explain:
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets.
DL4J comes with a Barnes-Hut based t-distributed stochastic neighbor embedding implementation. To use it, we first prepare for an n-dimensional array of doubles. (The default that we relied on above is floats.)
(Nd4j/setDataType DataBuffer$Type/DOUBLE)
Visualizing saved word vectors (words.txt)
The DL4J/ND4J example is written to visualize a dataset DL4J provides in "words.txt" instead of the model we saved above, so we’ll work with that first. (This dataset is the result of saving a Word2Vec model with the deprecated writeWordVectors method.)
We need to load in the predefined word vectors.
(def vectors
(WordVectorSerializer/loadTxt (.getFile (ClassPathResource. "words.txt"))))
Out of the word vectors we define weights and words in separate lists:
(def weights
(.getSyn0 (.getFirst vectors)))
(def words ;; aka `cache` or `cacheList` in the original Java
(map #(.getWord %) (.vocabWords (.getSecond vectors))))
Then we build and fit a dual-tree TSNE model with the relevant DL4J class:
(def words-tsne
(-> (BarnesHutTsne$Builder.)
(.setMaxIter 100)
(.theta 0.5)
(.normalize false)
(.learningRate 500)
(.useAdaGrad false)
(.build)))
(.fit words-tsne weights)
All that’s left is to save them to a file:
(io/make-parents "target/tsne-standard-coords.csv")
(.saveAsFile tsne words "target/tsne-standard-coords.csv")
To visualize the saved TSNE, we reach out to gnuplot. Make sure you have it installed (brew install gnuplot
may be applicable) and then call tsne.gnuplot from the command line in the project directory: gnuplot tsne.gnuplot
.
This should produce a file called tsne-plot.svg in the root of the project directory. It should look something like this:
We can improve our visualization by extending it to three dimensions. This requires only adding a one-line configuration option, (.numDimension 3)
, right before we .build
. Since this is our second time writing nearly the same code, and I can see into the future and know this isn’t our last TSNE tango, let’s write a helper function.
(defn build-tsne
([] (build-tsne 2))
([dims]
(-> (BarnesHutTsne$Builder.)
(.setMaxIter 100)
(.theta 0.5)
(.normalize false)
(.learningRate 500)
(.useAdaGrad false)
(.numDimension dims)
(.build))))
(def words-tsne-3d (build-tsne 3))
(.fit words-tsne-3d weights)
(.saveAsFile words-tsne-3d words "target/tsne-standard-coords-3d.csv")
Using that output with the appropriately named script should produce something like:
These visualizations are not tremendously robust, but they could be a starting point for more rigorous examination of the fitted model.
Visualizing our saved Word2Vec model
Let’s re-run those visualization steps over the Word2Vec model we saved earlier.
Since we saved our model using writeWord2VecModel
rather than the deprecated writeWordVectors
, we can’t use the same loadText
method we did when reading "words.txt". But the rest of the steps are essentially the same:
(def w2v
(WordVectorSerializer/readWord2VecModel "saved-word2vec-model.zip" true))
(def w2v-weights
(.getSyn0 (.lookupTable w2v)))
(def w2v-words
(map str (.words (.vocab w2v))))
(def w2v-tsne (build-tsne))
(.fit w2v-tsne w2v-weights)
(io/make-parents "target/tsne-w2v.csv")
(.saveAsFile w2v-tsne w2v-words "target/tsne-w2v.csv")
;;;; now we do the same for a 3d plot
(def w2v-tsne-3d (build-tsne 3))
(.fit w2v-tsne-3d w2v-weights)
(.saveAsFile w2v-tsne-3d w2v-words "target/tsne-w2v-3d.csv")
That’s really it. Use gnuplot from the command line to turn those CSV files into plots, and you’re done.
Importing more comprehensive models
As mentioned above, no model built over the toy raw sentences corpus will be useful. If you don’t believe me, look at how few words the model even contains:
w2v-words ;; we computed this earlier during visualization
=> ("been" "year" "about" "your" "without" "these" "companies" "music" "would" "because" "state" "they" "you" "$" "going" "old" "want" "night" "them" "then" "court" "an" "-" "each" "former" "as" "at" "left" "much" "be" "another" "two" "long" "how" "into" "see" "found" "same" "are" "does" "by" "national" "where" "after" "so" "a" "think" "set" "business" "though" "one" "i" "right" "team" "many" "people" "the" "such" "s" "police" "days" "to" "under" "did" "but" "through" "country" "had" "do" "good" "down" "white" "school" "has" "up" "five" "us" "those" "which" "last" "might" "this" "its" "she" "never" "take" "know" "little" "next" "some" "united" "for" "show" "back" "house" "we" "life" "states" "yesterday" "not" "street" "now" "end" "company" "just" "every" "over" "center" "was" "go" "war" "way" "home" "with" "what" "money" "there" "well" "time" "family" "he" "president" "play" "very" "big" "called" "ago" "american" "program" "during" "when" "three" "years" "put" "her" "children" "four" "officials" "season" "if" "case" "between" "still" "in" "made" "work" "director" "is" "come" "it" "being" "million" "even" "among" "john" "women" "other" "city" "against" "our" "out" "world" "government" "too" "get" "have" "federal" "man" "place" "may" "could" "more" "off" "first" "before" "use" "own" "several" "political" "used" "office" "while" "him" "second" "that" "high" "his" "than" "members" "me" "should" "only" "west" "few" "from" "day" "group" "all" "--" "new" "law" "like" "mr" "ms" "less" "my" "both" "most" "market" "were" "who" "since" "here" "no" "game" "week" "nt" "university" "part" "their" "best" "around" "percent" "can" "general" "times" "public" "and" "of" "today" "said" "department" "says" "make" "on" "or" "will" "say" "also" "any" "york" "until")
Our toy model understands only a couple hundred words, and a good portion of those are useless filler like “at” or “the”. We need something more robust. Unfortunately, fitting a model over a sufficiently large corpus takes a good deal of compute time. Thankfully, the NLP community has a number of pre-trained models for us to use, open-source and robustly built.
Google News model
First up is a model released by Google which contains “300-dimensional vectors for 3 million words and phrases...trained on part of Google News dataset (about 100 billion words)” (source). What a resource! DL4J recommends it:
The Google News Corpus model we use to test the accuracy of our trained nets is hosted on S3. Users whose current hardware takes a long time to train on large corpora can simply download it to explore a Word2vec model without the prelude.
Let’s download that. Importing it is a one-liner, but beware: it may take a few minutes.
(def gnews-vec
(WordVectorSerializer/readWord2VecModel "/path/to/GoogleNews-vectors-negative300.bin.gz"))
Now that we have a more well-read model, let’s try to do some of that famous word2vec “word math”. To kick off, let’s try again with that classic “king - queen = man - woman”:
Throughout this document, when I introduce an analogy of the form a:b::c:d
on its own line, I’ll share the outcome that the DL4J folks put in their docs. Ours won’t match exactly, even using the same corpus, because the training process involves some randomness.
;; king:queen::man:[woman, Attempted abduction, teenager, girl]
(.wordsNearest gnews-vec
["queen" "man"] ; "positive" words
["king"] ; "negative" words
5)
;; => ["woman" "girl" "teenage_girl" "teenager" "lady"]
Notice the placement of “king”, “queen”, and “man”. This might take a moment to wrap your head around. For me, “subtracting” the word “king” and “adding” the words “queen” and “man” to reach the word “woman” was not immediately intuitive. It helps to stop thinking of individually “positive” words and “negative” words. We‘re not making the word “man” “positive” or “king” “negative”. Instead, focus on the relationships between the words, and do some algebra. Starting from our analogy, we want to find an arrangement of these words that fits the DL4J API:
king : queen :: man : woman
Since analogy is commutative in this context, we can swap each pair around:
queen : king :: woman : man
We can represent the analogy-relationships as difference, which under DL4J’s hood translates to cosine distance:
"queen - king = woman - man"
Finally, we move "man" to the left side of the equation to isolate the target word "woman":
"queen - king + man = woman"
...and we have our “positive” and “negative” words that define the analogy for DL4J.
Our results (["woman" "girl" "teenage_girl" "teenager" "lady"]
) don’t include the somewhat wacky “Attempted abduction” that the DL4J folks got, but they make broad sense.
Now is a good time for a caveat about word vector analogies. When we ask for the wordsNearest
some word or words, the API will not return the words we gave it. So when we ask our Word2Vec model, "man is to doctor as woman is to ?"
, we must be careful not to over-interpret results like nurse
, since the word doctor
was programmatically excluded even if it was closest in vector space (Nissim, van Noord, van der Goot 2019). In some scenarios this is what we want – the original word is often not an interesting result. Other times it absolutely is.
The broader point to remember is that fun as they are, word vector analogies are lossy and inexact: more a party trick than a rigorous analytic tool. Analogies are good for building an intuition about word vector models, but that intuition is inherently fuzzy.
With that caveat in mind, let‘s take a look at some more examples. Here‘s one more with the Google News Corpus model.
;; China:Taiwan::Russia:[Ukraine, Moscow, Moldova, Armenia]
(.wordsNearest gnews-vec
["Taiwan" "Russia"]
["China"]
5)
;; => ["Ukraine" "Russian" "Moscow" "Moldova" "Armenia"]
I wonder whether the DL4J folks remove “Russian” automatically or manually.
In a related aside, the Google News vectors appear to be cluttered with synonyms and misspellings:
(.wordsNearest gnews-vec "United_States" 5)
;; => ["Unites_States" "Untied_States" "United_Sates" "U.S." "theUnited_States"]
Dealing with these through the existing DL4J API requires some data massage.
GloVe
We can also work with Global Vectors models. The GloVe authors kindly provide pre-trained models for our convenience:
(def glove-vectors
(WordVectorSerializer/loadTxtVectors (io/file "/path/to/glove.6B.50d.txt")))
What does GloVe have to say about kings and queens?
(.wordsNearest glove-vectors
["queen" "man"] ; "positive" words
["king"] ; "negative" words
5)
;; => ["woman" "girl" "her" "boy" "she"]
Interesting: similar, but not identical, to results from the Google News Corpus model. Let’s try some more comparisons.
Deeper explorations
Now that we have armed ourselves with both knowledge of the DL4J Word2Vec API and two well-trained models, we are ready to dive deep into the seas of word arithmetic.
Refactoring
Since we’re going to check a lot more analogies using wordsNearest
, it will be helpful to make a more concise helper function. Consider again the analogy “king is to queen as man is to woman”, or king:queen::man:woman
:
(.wordsNearest gnews-vec
["queen" "man"] ; "positive" words
["king"] ; "negative" words
5)
Looking at this helps us see the function we want. That we’re using wordsNearest
is an implementation detail; what we’re really doing is asking for an analogy
. That will be our function name. We can also see how it is parameterized: we’re giving that future function gnews-vec
, "queen"
, "man"
, "king"
, and 5
. That “queen” and “man” are grouped together, or that they and “king” are in vectors, is immaterial to our desired analogy
function. This makes it clear how we would want to call this function:
(analogy gnews-vec "king" "queen" "man")
First comes the model or word vectors, and then we define the elements of our analogy in the order that we would say them. Except — we forgot to include a parameter for the number of results. We can expect to want to call it both with or without defining the maximum number of results.
(analogy gnews-vec 5 "king" "queen" "man")
Factoring out those arguments into parameters, we have our function:
(defn analogy
"Word2Vec word analogies.
According to the given `model`, `a` is to `b` as `c` is to the `n` value(s) returned.
That is, a : b :: c : [return value]. Defaults to n<=5 results.
For example: king:queen::man:[woman, Attempted abduction, teenager, girl]
(NB: Results may vary across fitting runs, even using the same source corpus.)"
([model a b c] (analogy model 5 a b c))
([model n a b c]
(.wordsNearest model
[b c]
[a]
n)))
Let’s try our new function on that classic example:
;; king:queen::man:[woman, Attempted abduction, teenager, girl]
(analogy gnews-vec "king" "queen" "man")
;; => ["woman" "girl" "teenage_girl" "teenager" "lady"]
Success!
We’re going to be running a lot of analogy checks based on the Google News vectors, so let’s make another helper function.
(def gnews-analogy (partial analogy gnews-vec))
Take it for a spin with variations on the king/queen/man/woman theme.
(gnews-analogy 1 "man" "king" "woman")
;; => ["queen"]
(gnews-analogy "king" "prince" "queen")
;; => ["princess" "duchess" "Camilla" "Mette_Marit" "Princess"]
;; I wonder what GloVe says?
(analogy glove-vectors "king" "queen" "man")
;; => ["woman" "girl" "her" "boy" "she"]
Fantastic. What about the rest of our example analogies? They say, for example:
Not only will Rome, Paris, Berlin and Beijing cluster near each other, but they will each have similar distances in vectorspace to the countries whose capitals they are; i.e. Rome - Italy = Beijing - China. And if you only knew that Rome was the capital of Italy, and were wondering about the capital of China, then the equation Rome -Italy + China would return Beijing. No kidding.
Which suggests a set of vectors like:
This is easy-peasy for us to reproduce:
(gnews-analogy "Italy" "Rome" "China")
;; => ["Beijing" "Shanghai" "Bejing" "Hu" "Chinese"]
We inferred the capital of China, with the power of word2vec, DL4J, Clojure, and Google News! Take some time to realize just how big a deal this is:
...the Word2vec algorithm has never been taught a single rule of English syntax. It knows nothing about the world, and is unassociated with any rules-based symbolic logic or knowledge graph. And yet it learns more, in a flexible and automated fashion, than most knowledge graphs will learn after a years of human labor. It comes to the Google News documents as a blank slate, and by the end of training, it can compute complex analogies that mean something to humans.
Of course, GloVe can do the same trick (if you’re careful to downcase the place names):
(analogy glove-vectors "germany" "berlin" "china")
;; => ["beijing" "taipei" "shanghai" "taiwan" "chinese"]
What about "Two large countries and their small, estranged neighbors"?
;; China:Taiwan::Russia:[Ukraine, Moscow, Moldova, Armenia]
(gnews-analogy "China" "Taiwan" "Russia")
;; => ["Ukraine" "Russian" "Moscow" "Moldova" "Armenia"]
I wasn’t aware that Moscow was an estranged neighbor of Russia, but I see the point.
This next analogy is one of my favorites. It’s both fun and a good test case.
;; house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]
(gnews-analogy "house" "roof" "castle")
;; => ["dome" "bell_tower" "spire" "crenellations" "turrets"]
GloVe picks words that are just as accurate but from a different perspective:
(analogy glove-vectors "house" "roof" "castle")
;; => ["moat" "fortress" "battlements" "stonework" "ramparts"]
This one turned out a bit strange:
;; knee:leg::elbow:[forearm, arm, ulna_bone]
(gnews-analogy "knee" "leg" "elbow")
;; => ["forearm" "arm" "legs" "puncturing_lung" "ulna_bone"]
Yet again we need to manually remove the synonym (“leg”/“legs”). :/ And “puncturing lung” (!) seems to be an aberration.
The next analogy took some finesse.
;; New York Times:Sulzberger::Fox:[Murdoch, Chernin, Bancroft, Ailes]
The DL4J docs explain the relationship:
The Sulzberger-Ochs family owns and runs the NYT. The Murdoch family owns News Corp., which owns Fox News. Peter Chernin was News Corp.’s COO for 13 yrs. Roger Ailes is president of Fox News. The Bancroft family sold the Wall St. Journal to News Corp.
Finding the correct way to represent the token “The New York Times” required exploration. If you’re evaluating this at home, try the following expression with a few different variations on the name. (Use underscores for spaces.)
(.wordsNearest gnews-vec "NYT" 10)
My results don’t perfectly match the docs, but it’s close enough for casual NLP work:
(gnews-analogy "NYTimes" "Sulzberger" "Fox")
;; => ["Bancroft" "Murdoch" "Riggio" "FitzSimons" "ABC"]
Four more:
;; love:indifference::fear:[apathy, callousness, timidity, helplessness, inaction]
(gnews-analogy "love" "indifference" "fear")
;; => ["apathy" "callousness" "timidity" "helplessness" "inaction"]
;; Donald Trump:Republican::Barack Obama:[Democratic, GOP, Democrats, McCain]
;; "It’s interesting to note that, just as Obama and McCain were rivals, so too, Word2vec thinks Trump has a rivalry with the idea Republican."
(gnews-analogy "Donald_Trump" "Republican" "Barack_Obama")
;; => ["Democratic" "Democrat" "GOP" "Democrats" "McCain"]
;; monkey:human::dinosaur:[fossil, fossilized, Ice_Age_mammals, fossilization]
;; "Humans are fossilized monkeys? Humans are what’s left over from monkeys? Humans are the species that beat monkeys just as Ice Age mammals beat dinosaurs? Plausible."
(gnews-analogy "monkey" "human" "dinosaur")
;; => ["dinosaurs" "fossil" "fossilized" "Ice_Age_mammals" "fossilization"]
;; building:architect::software:[programmer, SecurityCenter, WinPcap]
(gnews-analogy "building" "architect" "software")
;; ["Software" "programmer" "sofware" "SecurityCenter" "WinPcap"]
Domain-specific similarity
Lastly, it’s fun to use Word2Vec models to look for similarity within a set of constraints. For instance, we can examine the cosine similarity of Nordic countries:
(.similarity gnews-vec "Sweden" "Sweden")
;; 1.0
(.similarity gnews-vec "Sweden" "Norway")
;; 0.7706172105613238
We can go a step further, and look for all words closely associated with “Sweden”, sorted by similarity. (Careful---the whole batch took ~400 seconds for me. You might want to un-comment the take 10000
if you want to sample the results before diving into a long computation.)
(->> (.words (.vocab gnews-vec))
;; (take 10000)
(map (juxt identity #(.similarity gnews-vec "Sweden" %)))
(sort-by second >)
(take 10))
;; => (["Sweden" 1.0]
;; ["Finland" 0.8084677457809448]
;; ["Norway" 0.7706173658370972]
;; ["Denmark" 0.7673707604408264]
;; ["Swedish" 0.7404001951217651]
;; ["Swedes" 0.7133287191390991]
;; ["Scandinavian" 0.6518087983131409]
;; ["Stena_Match_Cup" 0.6437666416168213]
;; ["Netherlands" 0.6401048302650452]
;; ["official_Lars_Emilsson" 0.6374118328094482])
We get roughly similar results to the DL4J folks, who report “The nations of Scandinavia and several wealthy, northern European, Germanic countries are among the top nine.” But...our results notably lack Slovenia, Estonia, Switzerland, and Belgium. Stranger still, ours has several non-countries. One might suspect that the DL4J folks manually removed synonyms like “Swedish” and “Swedes”, but notice the absence of “Scandinavian”. They must have programmatically ignored non-country words. I don’t blame them for not sharing the code; if Java it was probably terribly long and if written in another language it would require a whole separate explanation. But we have Clojure, and massaging data like this is Clojure’s bread and butter.
We can similarly restrict our list to words that match country names by using the MARC code list for countries:
(def countries
(set (map (fn [s] (subs s (inc (.indexOf s " "))))
(string/split-lines (slurp "marc-country-codes.txt")))))
Now we can add a filter
and rest
to what we used above, making a calculation fast enough to get real results immediately:
(->> (.words (.vocab gnews-vec))
(filter countries)
(map (juxt identity #(.similarity gnews-vec "Sweden" %)))
(sort-by second >)
rest
(take 10))
;; => (["Finland" 0.8084677457809448]
;; ["Norway" 0.7706173658370972]
;; ["Denmark" 0.7673707604408264]
;; ["Netherlands" 0.6401048302650452]
;; ["Latvia" 0.6308621168136597]
;; ["Switzerland" 0.6276936531066895]
;; ["Austria" 0.6137555837631226]
;; ["Germany" 0.610486626625061]
;; ["Slovakia" 0.6045371294021606]
;; ["Estonia" 0.5964043736457825])
Our results are now virtually identical to DL4J’s. The top 3 results are the same with different ordering. The rest are broadly similar, swapping for instance Germany for Belgium, Slovenia for Slovakia, and Latvia for Iceland. These are the kinds of differences you would expect if you were able to ask a similar query of two of your friends, or of one of your friends on different months.
Furthermore, despite operating on a 3-million-word vocabulary, adding those two lines (plus another three to create a set of country names) brought this computation down to less than a second. Our speed gains come because filtering is relatively cheap and gets the sequence down to a manageable size before we execute our expensive similarity
and sort
operations.
Conclusion
That’s all, folks. We‘ve looked at how Word2Vec encapsulates meaning, how to create word vectors from a corpus, and what we can do with pre-built word vectors computed over large corpora. I hope you’ve had fun. I certainly did. For me, Word2Vec touches on fundamental truths about our human world, just like physics and biology. That‘s because in a way, word vectors are a more direct representation of words as they are actually used than any dictionary. Working with word vectors makes me feel like I can reach out and touch the Platonic ideal of language itself. (At least written language.)
What next? Working further with Word2Vec generally means feeding trained word vectors to a neural net of some kind. Cutting-edge techniques for natural language tasks rely on Word2Vec or similar strategies as a first input to their deep learning pipelines.
If you enjoyed this as a blog post, I encourage you to clone the repo and run it with a REPL. It’s truly a richer experience to be able to go “off-trail” and ask your own questions of the model with the article side-by-side.
— Dave Liepmann, 01 June 2019