Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Delete hyphen (special chars) before tokenization

In777In777 Member Posts: 29 Contributor II
edited November 2018 in Help

I would like to delete all hyphens in the text document which I analyze in Rapidminer. For that I use operator "Process documents from files" to analyze large PDF-files. Each file contains a lot of hyphens which I would like to delete before I'll tokenize the text into pieces (non letters). I've used operator "Replace token". With it I can replace hyphens with other symbols, but I cannot replace them with nothing or empty string(" "). I've tried also to use my own customized dictionary of stopwords(non-letters, -). This operator does no work at all. I've saved my dictionary containing the chars and words I want to delete as a text file (each in the new line). Can anybody help on this issue?

Best Answer

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Solution Accepted

    Hi,

     

    the problem with this operator is, that it actually does not permit that you enter en empty value in the replacement list. It will then automatically discard the entry.

    To circumvent this, you will need to enter something that is actually text, but will be empty. The easiest way is to make use of the regular expression and their capturing groups. The idea is simply to make an empty capturing group and replace the match with this empty group. If you don't know about regular expressions, I would recommend to read some tutorial, they are really powerful and can be useful in any number of events.

    So in your case instead of having to replace "-" by "", you will need to replace "()-" with "$1". The parenthesis defines a capturing group. As nothing is inside them, it will be empty. You address a capturing group in the replace by term with the $1.

     

    Here's an example that makes it work. Simply copy the xml and paste in RapidMiner.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.001">
    <operator activated="true" class="text:create_document" compatibility="7.1.001" expanded="true" height="68" name="Create Document" width="90" x="447" y="136">
    <parameter key="text" value="This is a - hyphen-word"/>
    <parameter key="add label" value="false"/>
    <parameter key="label_type" value="nominal"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.001">
    <operator activated="true" class="text:replace_tokens" compatibility="7.1.001" expanded="true" height="68" name="Replace Tokens" width="90" x="581" y="136">
    <list key="replace_dictionary">
    <parameter key="()-" value="$1"/>
    </list>
    </operator>
    </process>

     

    Greetings,

      Sebastian

Answers

  • bhupendra_patilbhupendra_patil Employee, Member Posts: 168 RM Data Scientist

    in replace token try replacement as slash followed by actual space i.e. "\ " without the doublequotes, have not tried that myself, but remember doing something like that in the past

Sign In or Register to comment.