Synonym Detection with Word2Vec

edited May 2020 in Knowledge Base
edited May 2020 in Knowledge Base

 Introducing the Word2Vec Extension to the RapidMiner Marketplace! 

We recently published a new extension on our marketplace: an advanced algorithm for text mining called Word2Vec. The core operator is called Word2Vec and can be thought of as a learner. In the following I will shortly explain the basics of what Word2Vec does and afterwards how you can use this in your RapidMiner text mining processes.


What is Word2Vec

One of the key problems of text mining is that distances between words are hard to define. One could also say: "It’s hard to do math with words by itself in anyway." For example, there are words like beautiful and gorgeous, which have similar meanings but are spelled very differently. How should an algorithm know that "beautiful" and "gorgeous" have the same meaning? Or do they have similar connotations but have different meanings?


Word2Vec is a word vector algorithm which attempts to tackle this problem. As the title implies, this operator takes a word and turns it into a vector. So how is so special about Word2Vec? The cool part is that this new Word2Vec vector can be associated with the “meaning” of a word. For example:


1. Let's take a sentence from raw text:             RapidMiner has a new extension called Word2Vec


2. Now let's 'window' our sentence and always leave out the word in the middle:


                RapidMiner has  ___  new extension

                has a ___extension called

                new extension ___ Word2Vec


3. Word2Vec defines a probability P for the for the missing word, depending on the surrounding words. In fact, Word2Vec assigns a vector for every word. The whole trick of Word2Vec is that it optimises all vector entries to maximize the probability for the correct gap words and minimizes it for others. This way it assigns a vector to every word.


Sample Process with Word2Vec

There are various ways to use Word2Vec as a useful addition to your data science processes. In this sample process we will create a custom stemming dictionary from TripAdvisor review data (available here). All depicted processes are attached to this post.


Our analysis is split in three parts. The first part reads in the data and transforms it into a collection of documents. Each document is already tokenized.  The second process will then create a Word2Vec model on it, and the final third model is generating a stemming dictionary.


Step 1: Read and Tokenize

The data is provided in one flat file for each hotel with the following structure:


<Overall Rating>4
<Avg. Price>$302

<Content>Wonderful time- even with the snow! What a great experience! From the goldfish in the room (which my daughter loved) to the fact that the valet parking staff who put on my chains on for me it was fabulous. The staff was attentive and went above and beyond to make our stay enjoyable. Oh, and about the parking: the charge is about what you would pay at any garage or lot- and I bet they wouldn't help you out in the snow!
<Date>Dec 23, 2008
<No. Reader>-1
<No. Helpful>-1
<Check in / front desk>5
<Business service>-1


We read all files in with a Loop Files + Read Document combination, and then loop over all documents to extract only the content with a Cut Document operator. In the Cut Document we quickly transform all tokens to lower case and tokenize our document. After flattening the collection to one straight collection of documents, we store it in our repository for later use.


grafik.pngRead In Process


Step 2: Train the Model

Training a Word2Vec model is straightforward: get the data, apply Word2Vec, and store the result. The layer size, which defines the length of one vector, is set to a moderate 100 and the window size is set to 7. The iterations parameter is set to a high 50, which should ensure convergence.grafik.pngTraining Process

Step 3: Building the Stemming Dictionary


Building the final dictionary needs a tiny bit of postprocesseing. The new operator Extract Vocabulary is able to extract vectors for all or parts of the used corpus. Using Cross distance it is possible to get the distance between to word vectors measured in cosine similiary.

In the postpocessing we first need to remove duplicates of words which were created in the cross distance.


Afterwards there is a different type of duplicates. These are the ones were the first word in the first example equals the second word in the second example and vice versa.

Word1                                  Word2

Gorgeous                            Beautiful

Beautiful                              Gorgeous

 grafik.pngThe final processing process with a postprocessing which creates a stemming dictionary

Finally we apply a threshold on the similarity to produce a well-pruned list. This is controlled with a macro and can thus also be used from the outside. The only thing we need to make sure is that a word is not a synonym more than once. We can do this by removing some additional duplicates.


Let's have a look at the results!grafik.pngExamples for found synonymsIf you examine the results you can see some obvious similarities like wall and walls, and some more clever synonyms like people and guests, anywhere and somewhere.


Where it gets interesting are that sometimes words with opposite meanings are considered synonyms (best-worst, warm-cool etc). This is due to the way Word2Vec works in that these words can be put into the same gaps – hence considered similar to each other. Depending on the task you do this can be useful (e.g. topic recognition) or detrimental (e.g. sentiment analysis). For the latter you need to manually walk through the result list and prune more.


As a last step we can use an Aggregate operator in combination with a Generate Attributes operator to generate regular expressions.  For example:













The format can be used on any document you have. The operator for this is called “Stem Tokens using Example Set” and is part of Operator Toolbox extension.


Where can I learn more?





  Thanks very much @mschmitz

     Thanks very much @mschmitz

    for this fantastic process, Just experimenting now.


    If I want to analyse a group of documents and find not only the single words that have a vector relationship, but also bigram, trigram phrases is that possible? Or does it melt your computer...


    Can this be combined with any other text processing or modified to produce topical buckets of terms?


    I was wondering if it is possible to split the input documents by punctuation.


    I am inputting webpages that have headings etc., at present I am stripping out stop words, short strings 4 letters.


    Therefore, I end up with just long strings.


    However, was thinking if I split each document by sentence or paragraph/ list content? I could then create many separate documents (from one html page) that could be classified or grouped by similarity.


    Using document to similarity to process those buckets of sentences.


    I would get words output in the dictionary of Word2vec that are not just related to each other, but related to the concept (as defined by the documents to similarity groupings of sentences or lists extracted from the html document.)


    I am probably not thinking correctly about it.


    My goal to end up with buckets of words that could be then used in construction of paragraphs within a new written document that are known to be related by vector space. Not only to each other, but also to other words within the topical buckets.:smileyhappy:


     (The buckets being defined by the pre-processing using documents to similarity) rather than just individual words related to each other.


    I used the ITF/TO before and that works ok to find bigrams and trigram strings to get them on the page.


    However, the problem is the same you end up with the phrases on the page, but not necessarily near to each other.


    It works, concerning creating statistically similar page (google), but its very time consuming with lots of manual pruning.


    Then you have to post process your document for synonyms to ensure you have not overegged it.


    I would like to create some sort of process that stitches several processes together ITF/TO Word2Vec, Document clustering, LSI to produce some sort of master grouping of words.


    That way it would just be a matter of taking that grouping of n words and forming a meaningful paragraph out of it.


    Knowing in advance that it has ticked all the boxes.


    I purchased the book, not picked it up yet :)


    was alos looking at this. lda2vec


    is this possible in rapidminer??


    regards lee

  Hi @websiteguy,

    Hi @websiteguy,

    first of all: thanks for the kind words and using the operator. It is always cool to see, when people use the tools you write.

    Let's go through your questions a bit


    If I want to analyse a group of documents and find not only the single words that have a vector relationship, but also bigram, trigram phrases is that possible? Or does it melt your computer...

    Word2Vec by itself does not support bi_grams. But maybe you can find frequent bigrams using process_documents and use Replace Tokens to then replace e.g. not good with not_good which is then considered as one word in Word2Vec.


    I was wondering if it is possible to split the input documents by punctuation.

    Sure, Cut Document should do the trick.


    I am inputting webpages that have headings etc., at present I am stripping out stop words, short strings 4 letters. Therefore, I end up with just long strings. However, was thinking if I split each document by sentence or paragraph/ list content? I could then create many separate documents (from one html page) that could be classified or grouped by similarity.
    Using document to similarity to process those buckets of sentences.

    You can treat whole sentences as words in the operator. This also includes things like tags or parts of code. The only thing i would be worried about is, that you need enough sample size.

    I would get words output in the dictionary of Word2vec that are not just related to each other, but related to the concept (as defined by the documents to similarity groupings of sentences or lists extracted from the html document.)I am probably not thinking correctly about it.

    Not sure what you mean here.

    I would like to create some sort of process that stitches several processes together ITF/TO Word2Vec, Document clustering, LSI to produce some sort of master grouping of words.

    I would consider to cluster the vectors with some cosine similarity measure.



    Never saw this before, but thanks for the link! This is not yet supported but we may investigate this. The LDA vis package for python seems to be a good ressource for the recent LDA operator i published in toolbox.


    
    
  Hi @mschmitz

    Hi @mschmitz


    Thanks for the quick reply,



    By stripping out stop words and turning in to bi-grams, or tr-grams, create a document, then collect and save?


    Then process these strings of two or three words with the connecting _ and they would each would be a string used in the vector, is that right?



    I am trying to create a new document


    That has a statistical similarity to the original set of documents, by including these word2vec results in its creation.


    (I have found the ITF/TO works but it does not allow for distance, so you have to slavishly ensure the inclusion of bigrams/tri-grams that occur in the original documents to ensure similarity.  Even then, you have to return to your document at later date and shift the usage of the strings about to get the nearness to other bigram strings.





    If clustering were done on our newly created document, the original set of documents and a random set of other docs, the new doc would fall in to the same cluster as the original set as it "is like" the original set.




    At present, the vector interpretation of documents produces words from a set of documents that co-occur (by a distance of K words/synonyms from each other) therefore these words have a relationship. Is that correct?


    So when processing documents, we get a list of words and co-occurring examples of words, that acts as a representation to commonalities of word usage as defined by 'K' distance (the stemrule)


    It word2vec helps us to know we should include, "acne|naturally|grab” in to sentence in our new document.


    "For suffers of acne, I would always treat it naturally, that’s why I suggest you grab a copy of my new book"


    However, not how near this sentence should be to another sentence that includes another stemrule?


    So if I used another stemrule in a sentence:


    "It’s absolutely vital that when keeping an injury or trauma protected we act quickly to ensure the bone does not shift"


    These two new sentences could be in the same paragraph or distant from each other in the new document.


    Is there any way to know this "nearness of stemrules"? So that the vector stemrules are used in a way that insures their nearness to other stemrules is takes in to account stemrules distance from other stemrules?



    Therefore, we get "grouped stem rules" and therefore our new document we produce is more "like the originals"



    Or, is this essentially what lda2vec is doing?





    "I would consider clustering the vectors with some cosine similarity measure"


    Any chance you could show me how to do this, or explain a little further?


     thanks fr your help,


    regards lee






  There seems to be a lot of questions floating on the Community on how to use Word2Vec with Twitter data. I made a fast and dirty process on how to do it here

    There seems to be a lot of questions floating on the Community on how to use Word2Vec with Twitter data. I made a fast and dirty process on how to do it here

  • CampelloCampello Member Posts: 3 Learner I
    Okay, so I'm completely new to this. I keep getting an error when looping over my own dataset, which is an Excel file. It says the number of iterations can't be smaller than 1. I've tried to run "read excel" instead of "read document" but with no results. Also, when I cut the document i should fill "string matching queries" and I can't figure out what that means.  Can you maybe give me hand? 
  • kaymankayman Member Posts: 662 Unicorn
    Could you share your process? To read excel you definitly need to use the read excel operator, this loads the data as an example set (like a spreadsheet, using columns and rows). 

    Now, if you want to do some proper textmining, it means this data needs to be converted to documents (so text format). There are quite some operators with different options for the job, so it al' depends on what you actually want/need to do, and how your excel is constructed. 
  the error you describe indicates, that the Loop Files operator does not find any files meeting you conditions. Can you make sure the directory is set up correctly and maybe do not any filter on the files?
    the error you describe indicates, that the Loop Files operator does not find any files meeting you conditions. Can you make sure the directory is set up correctly and maybe do not any filter on the files?
    
    
  • CampelloCampello Member Posts: 3 Learner I
    Hey guys, thanks for the quick reply. So, it seems like after I've put "data to docs" inside the loop thing it finally works. Although I'm not sure if it was the perfect operator, since as @kayman suggests that may depend on my data and what I need to do (what I need to do is, well, find the meanings of certain words, such as "people" and "nation" in these speeches, looking for an 'lsa' kind of thing here). I keep ketting errors when cutting document tho. I'm attaching a few images I think may help you to understand my issues. One shows my dataset (a series of parlamentary speeches, over 900 rows). The others, my processes.Btw, when cutting the document I've set the queries to "," and ",", coz I didn't know what to do and I figured "," was as good a guess as any other lol, just to see if it worked (it didnt, but not for that reason haha). Thank you so much for helping a newbie, I sure appreciate it :)

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,531 RM Data Scientist
    i'll post something on the tech issue, but it looks like you want to build something like this: https://www.zeit.de/politik/deutschland/2019-09/bundestag-jubilaeum-70-jahre-parlament-reden-woerter-sprache-wandel#s=pay gap ? Its basically a data journalism piece on all speeches hold in German Bundestag. I know its German and your text is Portugese, but maybe this is still a nice reference for you.

    
    
  • CampelloCampello Member Posts: 3 Learner I
    Hi @mschmitz ! This looks very beautiful! Not exactly what I'm going for, since I'm analyzing president's Bolsonaro's speeches (as a former deputy) only, not the whole of the speeches at the Assembly, but you got the point and I'll save that website, it gave me important insights. The goal is to compare the results I get from Bolsonaro's conceptions of nation and ppl with Marine Le Pen's, processing her speeches in the same manner. I sould be able to detect if they have similar of divergent ideas about those themes. I'm a political science researcher and research on the topic of contemporary right wing populism :) 

  edited January 2021
    edited January 2021
    that's why i thought this is interesting to you. One of the examples they are exploring is the change from the word Ausländer (Foreigner) to the word Migrant (migrant) and how often this was used over time. You can see how the frequency is high in the early 90s when there were racist riots in Germany but also in 2015, when a stream of refugees came to Germany. So if you speak German this is a good source of inspiration for you.
    The source (Zeit) is one of the most known and trusted news papers in Germany comparable to NYT or washington post.

    On your process: Are you able to share the data with me? That would allow me to quickly set it up for you. You can send it my email to mschmitz at rapidminer.com


    
    
