whitespace regular expression filter tokens by content

student24 · April 2014

Hello everybody,

I want to search words from documents. I use the operator Filter Tokens by content with regular expression. If I want to search more than one word I use word1|word2|...|wordn. Now my question is how can I search an expression where there is a whitespace? For example "Research and Development|Word2|Word3 etc. ". Is there any wildcard for whitespaces?

Thanks for your help

RalfKlinkenberg · April 2014

You can use

[tt]\s[/tt] as a placeholder for a whitespace character,
[tt]\s+[/tt] for one or more whitespace characters, and
[tt]\s*[/tt] for zero, one, or more whitespace characters.
[tt]\t[/tt] is a placeholder for tabulator symbols.

RapidMiner regular expressions use the Java syntax for regular expressions. If you search for "[tt]Java regular expressions[/tt]" with Google or another search engine, you will find a lot of documentation.

Best wishes,
Ralf

student24 · April 2014

Thank you very much for your reply.

I have tried these out before but it doesnt work. There are no results in the word list although the expression is in the document. I dont know what I'm doing wrong. Do you know if it works when I'm examining pdf files?

Thanks

RalfKlinkenberg · April 2014

If you post the XML code of your RapidMiner process here, there is a chance that someone in the forum maybe able to help.

Without being able to see the RapidMiner process, we can only guess where the problem in your RapidMiner process might be.

student24 · April 2014

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="30">
<list key="text_directories">
<parameter key="test" value="C:"/>
</list>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="380" y="30"/>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="514" y="30">
<parameter key="condition" value="matches"/>
<parameter key="regular_expression" value="research\sand\sdevelopment"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

fras · April 2014

For some reason your XML is not valid but the important line is this:
<parameter key="regular_expression" value="research\sand\sdevelopment"/>
If you search for "Research..." this Regex will fail because upper/lower case
matters unless you ignore it by applying the regex switch "i".
It will fail also if there are more than one whitespaces between the words.

student24 · April 2014

I thought I ingnore the upper/lower case by the operator "Transform Cases" and select the option "lower case". For more than one whitespace I could use \s+ but it also doesnt work.
Why is my XML not valid?

MariusHelf · April 2014

Corrected XML above.

student24 · April 2014

ok thank you. I copied it but the problem with the whitespaces isnt solved. I dont know what Im doing wrong.
Is there maybe another operator or another way I can search for expression in documents?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

whitespace regular expression filter tokens by content

Answers