"Problems with filtering attributes with regex"

TobiasNehrigTobiasNehrig Member Posts: 41 Guru
edited June 2019 in Help

Hi experts,

I have to create a cooccurrence graph and so I create a corpus and a occurrence matrix. With the occurrence matrix I have some problems, I can't get it to filter words with 3 or more letters for my analysing. When I use for example [(0-9)+][-!"#$%&'()*+,./:;<=>?@\[\\\]_`{|}~][(0-9)+] [(a-z){3,}] all coulums will be deleted.

 

Has anyone an idea to fix this problem?

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<parameter key="logfile" value="/home/knecht/Master2017/Rapp/Logfile.log"/>
<parameter key="resultfile" value="/home/knecht/Master2017/Rapp/resultfile.res"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="34">
<parameter key="url" value="http://www.fask.uni-mainz.de/user/rapp/papers/disshtml/main/main.html"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value="http://www.fask.uni-mainz.de/user/rapp/papers/disshtml/.*"/>
<parameter key="follow_link_with_matching_url" value="http://www.fask.uni-mainz.de/user/rapp/papers/disshtml.*"/>
</list>
<parameter key="max_crawl_depth" value="10"/>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="/home/knecht/Crawler"/>
<parameter key="max_pages" value="1000"/>
<parameter key="max_page_size" value="500"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0"/>
<parameter key="ignore_robot_exclusion" value="true"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="45" y="136">
<parameter key="link_attribute" value="Link"/>
<parameter key="page_attribute" value="link"/>
<parameter key="random_user_agent" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="45" y="238">
<parameter key="keep_text" value="true"/>
<list key="specify_weights">
<parameter key="link" value="1.0"/>
</list>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
<parameter key="minimum_text_block_length" value="2"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize Token" width="90" x="45" y="136">
<parameter key="mode" value="linguistic tokens"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="45" y="238"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Tokenize Token" to_port="document"/>
<connect from_op="Tokenize Token" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="179" y="34">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="text" value="1.0"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Data to Document" width="90" x="313" y="34"/>
<operator activated="true" class="write_as_text" compatibility="7.6.001" expanded="true" height="82" name="Write Korpus" width="90" x="447" y="34">
<parameter key="result_file" value="/home/knecht/Master2017/Korpus/17-12-01-Rapp-Korpus.res"/>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="7.5.000" expanded="true" height="82" name="WordList to Data" width="90" x="179" y="289"/>
<operator activated="true" class="write_excel" compatibility="7.6.001" expanded="true" height="82" name="Write Excel Wordlist" width="90" x="447" y="391">
<parameter key="excel_file" value="/home/knecht/17-12-01-Rapp-Wordlist.xlsx"/>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="187">
<parameter key="text_attribute" value="text"/>
<parameter key="label_attribute" value="text"/>
<parameter key="data_management" value="memory-optimized"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="187"/>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="289">
<parameter key="attribute_filter_type" value="regular_expression"/>
<parameter key="regular_expression" value="[(0-9)+][-!&quot;#$%&amp;'()*+,./:;&lt;=&gt;?@\[\\\]_`{|}~][(0-9)+] [(a-z){3,}] "/>
<parameter key="value_type" value="text"/>
<parameter key="use_value_type_exception" value="true"/>
<parameter key="except_value_type" value="text"/>
<parameter key="block_type" value="value_matrix"/>
</operator>
<operator activated="true" class="write_excel" compatibility="7.6.001" expanded="true" height="82" name="Write Excel Korpus" width="90" x="447" y="187">
<parameter key="excel_file" value="/home/knecht/17-12-01-Rapp-RohMatrix.xlsx"/>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Data to Document" to_port="input"/>
<connect from_op="Data to Document" from_port="output 1" to_op="Write Korpus" to_port="input 1"/>
<connect from_op="Data to Document" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Write Korpus" from_port="input 1" to_port="result 1"/>
<connect from_op="WordList to Data" from_port="word list" to_port="result 4"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Write Excel Wordlist" to_port="input"/>
<connect from_op="Write Excel Wordlist" from_port="through" to_port="result 5"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Write Excel Korpus" to_port="input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 3"/>
<connect from_op="Write Excel Korpus" from_port="through" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>

 

17-12-02-Crawler Process.png

Tagged:

Best Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Inside your "Process Documents" after you have Tokenized your words,simply use the "Filter Token by Length" operator and set it to minimum length desired.  That's a much easier way to get to what you are trying to accomplish I think.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted

    Hi!

     

    You have a highly complex and very specific regex. I wasn't even able to find a text that it matches.

    The use of character classes [] and parentheses () the way you're doing it is not very common. This would be more standard usage: [a-z()] (if you're really matching lower case characters and the opening and closing parentheses).

     

    The regexp also has a space at the end.

    In Select Attributes, the regexp must match the whole attribute name. (Usually regexes just need to match a part of the target, Select Attributes is different in this regard.)

     

    When developing regexes, it's best to start from a simple state and then build up on that, using RapidMiner's testing methods.

     

    If I understand your problem, the regex (\w+-){2}\w+ would be a simple representation of "word-word-word". You can start from this and build upon it. 

     

    Regards,

    Balázs

Answers

Sign In or Register to comment.