"Retaining selected word pairs when tokenizing"

carl · December 2016

When tokenizing into single word tokens, is there a way to keep selected pairs of words together as a single token?

For example, in soccer the term "centre forward" makes more sense as a single token. I looked at n-grams, but this pairs words that I do not want to pair. I tried using the stem dictionary, but this seems not to work across multiple tokens, and if I put the stem before tokenize, e.g. to change centre forward to centre-forward, this doesn't appear to work.

IngoRM · December 2016

Hi Carl,

All observations are correct. Since there is no replace operator across multiple tokens, I think you have to apply the Replace operator on the data set in your case. The other options do not seem to be really feasible here.

But don't worry, you can actually do this by first transforming your document into an example set, perfom the replacement, and transform it back into a document. The process below shows you how you can do this. Please note that you either need to change your tokenization to something else than "non letters" or you need to use letters as the delimiter in your replacement (or just no delimiter at all).

This is probably not winning a first price for elegance but it does the job :smileywink:

Hope that helps,

Ingo

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="7.3.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
        <parameter key="filename" value="C:\Users\IngoMierswa\Desktop\Latest Materials\Data\mini_newsgroups\mini_newsgroups\alt.atheism\51121"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="179" y="34"/>
      <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="34">
        <parameter key="text_attribute" value="text"/>
      </operator>
      <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Replace" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="replace_what" value="political correctness"/>
        <parameter key="replace_by" value="politicalDELIMcorrectness"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="7.3.000" expanded="true" height="68" name="Data to Documents" width="90" x="581" y="34">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="select" compatibility="7.3.001" expanded="true" height="68" name="Select" width="90" x="715" y="34"/>
      <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="849" y="34">
        <parameter key="characters" value=".: "/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
      <connect from_op="Read Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Select" to_port="collection"/>
      <connect from_op="Select" from_port="selected" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

IngoRM · December 2016

Hi,

you could apply a "Replace" operator before you tokenize. Let's assume your text documents are initially stored as values in a nominal / text column. Then you can use "Replace" to, well, replace "centre forward" by "centre_forward" which will be kept as is in a later tokenization.

Hope that helps,

Ingo

Telcontar120 · December 2016

You could do this a couple of different ways. First, you could use n-grams and then a custom stopword dictionary after that to remove the n-grams that you are not interested in (it just requires a text file as input so if you output the word list after the n-gram using "Wordlist to Data" then you should be able to copy/paste the relevant items into a text file fairly easily). This is probably the way I would do it if I had a large number of substitutions to make.

Another approach would be to use a stem dictionary. It sounds like you tried a variation of this, but you would want to place it after the n-gram operator and after tokenize. I don't see why that approach wouldn't work, although I haven't tried it.

A third option if you have only a few of these substitutions to make is simply to use the "replace token" operator, which allows you to use regular expressions for your substitution search.

I hope this helps!

IngoRM · December 2016

Hi Brian,

Just to add on the last one: the problem is that "Replace Token" only works on single tokens, so if you already have tokenized the text, the two words are now separated into two tokens and can no longer be replaced...

Cheers,

Ingo

Telcontar120 · December 2016

Yep, sorry, I should have clarified that in this instance you can use "replace token" after you have generated n-grams, so you could turn "centre_.*" into "centre_forward" or similar.

carl · December 2016

Thanks Brian / Ingo. Sorry, I should have attached my process with the question. Looking at the different options:

1 - Replace before tokenization acts on an example set, and I'm initially processing a document.

2 - n-grams would create too many pairings that I would not be interested in, and if I deleted these, my frequency count would understate certain words if they'd been part of the n-grams.

3 - Stemming after tokenization would require a lot of patterns, and I'd need to re-aggregate the word frequency after breaking up the n-grams that I'm not interested in.

4 - Replace after tokenization and n-graming would have a similar effect as per 3.

For the most part, I'm interested in single words, with just a few exceptions where compound nouns (or concepts) make more sense than the inidividual words. And I wanted to see if I can distill a PDF in as few steps as possible. So ideally a Replace acting early on a document to hyphenate the concepts I want to retain as tokens might be ideal if that were possible.

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="7.3.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
        <parameter key="filename" value="/Users/carl/Documents/sample.pdf"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="179" y="34">
        <parameter key="content_type" value="pdf"/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34"/>
      <operator activated="true" class="text:stem_dictionary" compatibility="7.3.000" expanded="true" height="82" name="Stem (Dictionary)" width="90" x="447" y="34">
        <parameter key="file" value="/Users/carl/Documents/Stemming.txt"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_english" compatibility="7.3.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="581" y="34"/>
      <operator activated="true" class="text:transform_cases" compatibility="7.3.000" expanded="true" height="68" name="Transform Cases" width="90" x="715" y="34"/>
      <operator activated="true" class="text:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="849" y="34"/>
      <operator activated="true" class="text:process_documents" compatibility="7.3.000" expanded="true" height="103" name="Process Documents" width="90" x="983" y="34">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="prune_below_absolute" value="0"/>
        <parameter key="prune_above_absolute" value="10"/>
        <process expanded="true">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="7.3.000" expanded="true" height="82" name="WordList to Data" width="90" x="1117" y="34"/>
      <operator activated="true" class="sort" compatibility="7.3.001" expanded="true" height="82" name="Sort" width="90" x="1251" y="34">
        <parameter key="attribute_name" value="total"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="1385" y="34"/>
      <operator activated="true" class="write_csv" compatibility="7.3.001" expanded="true" height="82" name="Write CSV" width="90" x="1519" y="136">
        <parameter key="csv_file" value="/Users/carl/data.csv"/>
      </operator>
      <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Execute R" width="90" x="1519" y="34">
        <parameter key="script" value="rm_main = function(data)&#10;{&#10;    library(base)&#10;    library(grDevices)&#10;    library(wordcloud)&#10;    library(RColorBrewer)&#10;    setwd(&quot;/Users/carl&quot;)&#10;    png(filename=&quot;mypng4.png&quot;, bg=&quot;transparent&quot;)&#10;    cloud_df &lt;- data.frame(word = data$word, freq = data$total)&#10;    wordcloud::wordcloud(cloud_df$word, cloud_df$freq, scale=c(5,0.5), max.words=50, random.order=FALSE,&#10;    rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(9,&quot;Blues&quot;))&#10;    dev.off()&#10;}"/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
      <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_op="Stem (Dictionary)" to_port="document"/>
      <connect from_op="Stem (Dictionary)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
      <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
      <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
      <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Execute R" to_port="input 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Write CSV" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

carl · December 2016

Thank you. That worked well. I used Replace(Dictionary) so I could create a small number of replacements (via an Excel).

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Retaining selected word pairs when tokenizing"

Best Answer

Answers