Word2Vec is an algorithm that converts a word into vectors such that it groups similar words together into vector space. It is widely used in many applications like document retrieval, machine translation systems, autocompletion and prediction etc. In this tutorial, we will learn how to train a Word2Vec model using the Gensim library as well as loading pre-trained that converts words to vectors.
Word2Vec is an algorithm designed by Google that uses neural networks to create word embeddings such that embeddings with similar word meanings tend to point in a similar direction. For example, embeddings of words like love, care, etc will point in a similar direction as compared to embeddings of words like fight, battle, etc in a vector space. Such a model can also detect synonyms of the given word and suggest some additional words for partial sentences.
Gensim is an open-source Python library, which can be used for topic modelling, document indexing as well as retiring similarity with large corpora. Gensim’s algorithms are memory-independent with respect to the corpus size. It has also been designed to extend with other vector space algorithms.
Gensim provides the implementation of Word2Vec algorithm along with some other functionalities of Natural Language Processing in
Word2Vec class. Let’s see how to create a Word2Vec model using Gensim.
Develop a Word2Vec model using Gensim
Some useful parameters that Gensim Word2Vec class takes:
- sentences: It is the data on which the model is trained to create word embeddings. It can be a list of lists of tokens/words, or a data stream coming from network/disk in the case of large corpora. In our example, we will be using Brown Corpus present in NLTK.
- size: It represents how long you want the dimensionality of your vector to be for each word in the vocabulary. Its default value is 100.
- window: The maximum distance between the current word and its neighboring words. If your neighboring word is greater than the width, then, some neighboring words would not be considered as being related to the current word. Its default value is 5.
- min_count: It represents the minimum frequency value of words to be present in the vocabulary. Its default value is 5.
- iter: It represents the number of iterations/epochs over the dataset. Its default value is 5.
Example of using Word2Vec in Python
import string import nltk from nltk.corpus import brown from gensim.models import Word2Vec from sklearn.decomposition import PCA from matplotlib import pyplot nltk.download("brown") # Preprocessing data to lowercase all words and remove single punctuation words document = brown.sents() data =  for sent in document: new_sent =  for word in sent: new_word = word.lower() if new_word not in string.punctuation: new_sent.append(new_word) if len(new_sent) > 0: data.append(new_sent) # Creating Word2Vec model = Word2Vec( sentences = data, size = 50, window = 10, iter = 20, ) # Vector for word love print("Vector for love:") print(model.wv["love"]) print() # Finding most similar words print("3 words similar to car") words = model.most_similar("car", topn=3) for word in words: print(word) print() #Visualizing data words = ["france", "germany", "india", "truck", "boat", "road", "teacher", "student"] X = model.wv[words] pca = PCA(n_components=2) result = pca.fit_transform(X) pyplot.scatter(result[:, 0], result[:, 1]) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()
Some Output[nltk_data] Downloading package brown to /root/nltk_data... [nltk_data] Unzipping corpora/brown.zip. Vector for love: [ 2.576164 -0.2537464 -2.5507743 3.1892483 -1.8316503 2.6448352 -0.06407754 0.5304831 0.04439827 0.45178193 -0.4788834 -1.2661372 1.0238386 0.3144989 -2.3910248 2.303471 -2.861455 -1.988338 -0.36665946 -0.32186085 0.17170368 -2.0292065 -0.9724318 -0.5792801 -2.809848 2.4033384 -1.0886359 1.1814215 -0.9120702 -1.1175308 1.1127514 -2.287549 -1.6190344 0.28058434 -3.0212548 1.9233572 0.13773602 1.5269752 -1.8643662 -1.5568101 -0.33570558 1.4902842 0.24851061 -1.6321756 0.02789219 -2.1180007 -1.5782264 -0.9047415 1.7374605 2.1492126 ] 3 words similar to car ('boat', 0.7544293403625488) ('truck', 0.7183066606521606) ('block', 0.6936473250389099)
In the above visualization, we can see that the words student and teacher point towards one direction, countries like India, Germany, and France point in another direction, and words like road, boat, and truck in another. This shows that our Word2Vec model has learned the embeddings that can differentiate words based on their meaning.
Loading Pre-trained Models using Gensimd
Gensim also comes with several already pre-trained models as we can see below.
import gensim import gensim.downloader for model_name in list(gensim.downloader.info()['models'].keys()): print(model_name)
fasttext-wiki-news-subwords-300 conceptnet-numberbatch-17-06-300 word2vec-ruscorpora-300 word2vec-google-news-300 glove-wiki-gigaword-50 glove-wiki-gigaword-100 glove-wiki-gigaword-200 glove-wiki-gigaword-300 glove-twitter-25 glove-twitter-50 glove-twitter-100 glove-twitter-200 __testing_word2vec-matrix-synopsis
Let’s load the
word2vec-google-news-300 model and perform different tasks such as finding relations between Capital and Country, getting similar words, and calculating cosine similarity.
import gensim import gensim.downloader google_news_vectors = gensim.downloader.load('word2vec-google-news-300') # Finding Capital of Britain given Capital of France: (Paris - France) + Britain = print("Finding Capital of Britain: (Paris - France) + Britain") capital = google_news_vectors.most_similar(["Paris", "Britain"], ["France"], topn=1) print(capital) print() # Finding Capital of India given Capital of Germany: (Berlin - Germany) + India = print("Finding Capital of India: (Berlin - Germany) + India") capital = google_news_vectors.most_similar(["Berlin", "India"], ["Germany"], topn=1) print(capital) print() # Finding words similar to BMW print("5 similar words to BMW:") words = google_news_vectors.most_similar("BMW", topn=5) for word in words: print(word) print() # Finding words similar to Beautiful print("3 similar words to beautiful:") words = google_news_vectors.most_similar("beautiful", topn=3) for word in words: print(word) print() # Finding cosine similarity between fight and battle cosine = google_news_vectors.similarity("fight", "battle") print("Cosine similarity between fight and battle:", cosine) print() # Finding cosine similarity between fight and love cosine = google_news_vectors.similarity("fight", "love") print("Cosine similarity between fight and love:", cosine)
[==================================================] 100.0% 1662.8/1662.8MB downloaded Finding Capital of Britain: (Paris - France) + Britain [('London', 0.7541897892951965)] Finding Capital of India: (Berlin - Germany) + India [('Delhi', 0.72683185338974)] 5 similar words to BMW: ('Audi', 0.7932199239730835) ('Mercedes_Benz', 0.7683467864990234) ('Porsche', 0.727219820022583) ('Mercedes', 0.7078384757041931) ('Volkswagen', 0.695941150188446) 3 similar words to beautiful: ('gorgeous', 0.8353004455566406) ('lovely', 0.810693621635437) ('stunningly_beautiful', 0.7329413890838623) Cosine similarity between fight and battle: 0.7021284 Cosine similarity between fight and love: 0.13506128
Congratulations! Now you know Word2Vec and how to create your own model that converts words to vectors. Word2Vec is widely used in many applications like document similarity and retrieval, machine translations, etc. Now you can use it in your projects as well.
Thanks for reading!