Options

"Generate Attributes - function expression OR regex for

sirhcsirhc Member Posts: 3 Learner I
edited May 2019 in Help
Hello together, 

i have a nominal attribute title which contains a text description and between the text description the year (4 digits). Sometimes there are also some other digits in the text. So i have to search for "4 digits within the text" and generate a new attribute for year. 

Example: 

title = "that is the 1st test attribute 2019 but not the last one."

Now i want to extract the year of the title attribute. 

Year = 2019 

I tried it first with regex and the Replace operator with the regex  "\d{4}" but i only could replace the digits and not extract into a new attribute. 

Can someone please help me or give an idea how to solve this issue. 

Thank you in advance, i am a newbie to rapidminer. 

Best, 
Chris 

Best Answers

Answers

  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    edited March 2019
    Hi @sirhc,

    For the example text, we have at least three options.
    Extract Information
    Keep document parts
    Cut document

    Can you give a test of these operators with regex?

    My example process used two of them.
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="34">
            <parameter key="text" value="that is the 1st test attribute 2019 but not the last one"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="34"/>
          <operator activated="true" class="text:keep_document_parts" compatibility="8.1.000" expanded="true" height="68" name="Keep Document Parts" width="90" x="447" y="34">
            <parameter key="extraction_regex" value="\ \d{4}\ "/>
          </operator>
          <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="187">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries">
              <parameter key="year" value="\ \d{4}\ "/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Keep Document Parts" to_port="document"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Extract Information" to_port="document"/>
          <connect from_op="Keep Document Parts" from_port="document" to_port="result 1"/>
          <connect from_op="Extract Information" from_port="document" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    YY

  • Options
    sirhcsirhc Member Posts: 3 Learner I
    Hi @yyhuang

    thanks, this was okay but not completely as expected since it needed to be documents first. 

    Hi @IngoRM

    thank you very much. This helped me a lot :smile:
    Is it possible to only take the year and if there is no year in the attribute title i just leave it empty?
    Probably i have to run another Replace Operator and filter for something like that: [a-zA-Z] , right? 

    In the next step i have to generate a new attribute age and calculate the age by today minus the year attribute which i calculated extracted above. Is there a simple way for that? Or just an idea? 

    Thank you very much - you guys helped me a lot. 

    Best Chris 
  • Options
    sirhcsirhc Member Posts: 3 Learner I
    Hi Ingo, 

    thank you very much, this worked perfect. 

    Have a nice weekend, 
    Chris 
Sign In or Register to comment.