Gensim Word2Vec – A Complete Guide

Gensim Word2Vec

Word2Vec is an algorithm that converts a word into vectors such that it groups similar words together into vector space. It is widely used in many applications like document retrieval, machine translation systems, autocompletion and prediction etc. In this tutorial, we will learn how to train a Word2Vec model using the Gensim library as well as loading pre-trained that converts words to vectors.

Word2Vec

Word2Vec is an algorithm designed by Google that uses neural networks to create word embeddings such that embeddings with similar word meanings tend to point in a similar direction. For example, embeddings of words like love, care, etc will point in a similar direction as compared to embeddings of words like fight, battle, etc in a vector space. Such a model can also detect synonyms of the given word and suggest some additional words for partial sentences.

Gensim Word2Vec

Gensim is an open-source Python library, which can be used for topic modelling, document indexing as well as retiring similarity with large corpora. Gensim’s algorithms are memory-independent with respect to the corpus size. It has also been designed to extend with other vector space algorithms.

Gensim provides the implementation of Word2Vec algorithm along with some other functionalities of Natural Language Processing in Word2Vec class. Let’s see how to create a Word2Vec model using Gensim.

Develop a Word2Vec model using Gensim

Some useful parameters that Gensim Word2Vec class takes:

  • sentences: It is the data on which the model is trained to create word embeddings. It can be a list of lists of tokens/words, or a data stream coming from network/disk in the case of large corpora. In our example, we will be using Brown Corpus present in NLTK.
  • size: It represents how long you want the dimensionality of your vector to be for each word in the vocabulary. Its default value is 100.
  • window: The maximum distance between the current word and its neighboring words. If your neighboring word is greater than the width, then, some neighboring words would not be considered as being related to the current word. Its default value is 5.
  • min_count: It represents the minimum frequency value of words to be present in the vocabulary. Its default value is 5.
  • iter: It represents the number of iterations/epochs over the dataset. Its default value is 5.

Example of using Word2Vec in Python

import string
import nltk
from nltk.corpus import brown
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

nltk.download("brown")

# Preprocessing data to lowercase all words and remove single punctuation words
document = brown.sents()
data = []
for sent in document:
  new_sent = []
  for word in sent:
    new_word = word.lower()
    if new_word[0] not in string.punctuation:
      new_sent.append(new_word)
  if len(new_sent) > 0:
    data.append(new_sent)

# Creating Word2Vec
model = Word2Vec(
    sentences = data,
    size = 50,
    window = 10,
    iter = 20,
)

# Vector for word love
print("Vector for love:")
print(model.wv["love"])
print()

# Finding most similar words
print("3 words similar to car")
words = model.most_similar("car", topn=3)
for word in words:
  print(word)
print()

#Visualizing data
words = ["france", "germany", "india", "truck", "boat", "road", "teacher", "student"]

X = model.wv[words]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

pyplot.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

Output:

Some Output[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
Vector for love:
[ 2.576164   -0.2537464  -2.5507743   3.1892483  -1.8316503   2.6448352
 -0.06407754  0.5304831   0.04439827  0.45178193 -0.4788834  -1.2661372
  1.0238386   0.3144989  -2.3910248   2.303471   -2.861455   -1.988338
 -0.36665946 -0.32186085  0.17170368 -2.0292065  -0.9724318  -0.5792801
 -2.809848    2.4033384  -1.0886359   1.1814215  -0.9120702  -1.1175308
  1.1127514  -2.287549   -1.6190344   0.28058434 -3.0212548   1.9233572
  0.13773602  1.5269752  -1.8643662  -1.5568101  -0.33570558  1.4902842
  0.24851061 -1.6321756   0.02789219 -2.1180007  -1.5782264  -0.9047415
  1.7374605   2.1492126 ]

3 words similar to car
('boat', 0.7544293403625488)
('truck', 0.7183066606521606)
('block', 0.6936473250389099)
Word2Vec Visualization
Word2Vec Visualization

In the above visualization, we can see that the words student and teacher point towards one direction, countries like India, Germany, and France point in another direction, and words like road, boat, and truck in another. This shows that our Word2Vec model has learned the embeddings that can differentiate words based on their meaning.

Loading Pre-trained Models using Gensimd

Gensim also comes with several already pre-trained models as we can see below.

import gensim
import gensim.downloader

for model_name in list(gensim.downloader.info()['models'].keys()):
  print(model_name)
fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis

Let’s load the word2vec-google-news-300 model and perform different tasks such as finding relations between Capital and Country, getting similar words, and calculating cosine similarity.

import gensim
import gensim.downloader

google_news_vectors = gensim.downloader.load('word2vec-google-news-300')

# Finding Capital of Britain given Capital of France: (Paris - France) + Britain = 
print("Finding Capital of Britain: (Paris - France) + Britain")
capital = google_news_vectors.most_similar(["Paris", "Britain"], ["France"], topn=1)
print(capital)
print()

# Finding Capital of India given Capital of Germany: (Berlin - Germany) + India = 
print("Finding Capital of India: (Berlin - Germany) + India")
capital = google_news_vectors.most_similar(["Berlin", "India"], ["Germany"], topn=1)
print(capital)
print()

# Finding words similar to BMW
print("5 similar words to BMW:")
words = google_news_vectors.most_similar("BMW", topn=5)
for word in words:
  print(word)
print()

# Finding words similar to Beautiful
print("3 similar words to beautiful:")
words = google_news_vectors.most_similar("beautiful", topn=3)
for word in words:
  print(word)
print()

# Finding cosine similarity between fight and battle
cosine = google_news_vectors.similarity("fight", "battle")
print("Cosine similarity between fight and battle:", cosine)
print()

# Finding cosine similarity between fight and love
cosine = google_news_vectors.similarity("fight", "love")
print("Cosine similarity between fight and love:", cosine)

Output:

[==================================================] 100.0% 1662.8/1662.8MB downloaded
Finding Capital of Britain: (Paris - France) + Britain
[('London', 0.7541897892951965)]

Finding Capital of India: (Berlin - Germany) + India
[('Delhi', 0.72683185338974)]

5 similar words to BMW:
('Audi', 0.7932199239730835)
('Mercedes_Benz', 0.7683467864990234)
('Porsche', 0.727219820022583)
('Mercedes', 0.7078384757041931)
('Volkswagen', 0.695941150188446)

3 similar words to beautiful:
('gorgeous', 0.8353004455566406)
('lovely', 0.810693621635437)
('stunningly_beautiful', 0.7329413890838623)

Cosine similarity between fight and battle: 0.7021284

Cosine similarity between fight and love: 0.13506128

Conclusion

Congratulations! Now you know Word2Vec and how to create your own model that converts words to vectors. Word2Vec is widely used in many applications like document similarity and retrieval, machine translations, etc. Now you can use it in your projects as well.

Thanks for reading!