Extracting text from a record

paul_balaspaul_balas Member Posts: 11 Contributor I
Hi,

Is there an easy control to use to extract the text from the following field:

{

  "data": {

    "translations": [

      {

        "translatedText": "020114 - SECURITAS - Security - AE Menor - 14x7 - Van",

        "detectedSourceLanguage": "es"

      }

    ]

  }

}

I want to extract just the following text:  020114 - SECURITAS - Security - AE Menor - 14x7 - Van

Best Answers

  • paul_balaspaul_balas Posts: 11 Contributor I
    Solution Accepted
    Much easier!  Disappointing that some of these controls are so buggy.  This solved a problem I've been struggling with for about 4 hours.  Thank you!

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member Posts: 290   Unicorn
    Hello,

    Please find this XML file:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
            <parameter key="file" value="/Users/master/files/text.json"/>
            <parameter key="extract_text_only" value="false"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="SYSTEM"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="34">
            <parameter key="query_type" value="JsonPath"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries">
              <parameter key="translated" value="$.data.translations[*].translatedText"/>
            </list>
            <process expanded="true">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="34">
            <parameter key="text_attribute" value="text"/>
            <parameter key="add_meta_information" value="false"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    This is how the process looks:

    It uses jsonPath to extract information.

    You can use this site to explore and understand how to use jsonPath: http://jsonpath.com/

    Hope this helps,

    Rodrigo.
    SGolberttopaz_ndbabrauskaite
  • paul_balaspaul_balas Member Posts: 11 Contributor I
    edited February 12
    Thank you for the help!  My output in my sample above is an Example, not a document.  Do I need to convert the exa output into a document first?
  • paul_balaspaul_balas Member Posts: 11 Contributor I
    I've tried another way, but to no avail.  I'm using the 'Replace' operator.  I want to use a regex to extract text from a record like this.



    The goal is to extract the text:   Incident with Vehicle
    excluding any of the other text before or after it.

    Here is my process which is complaining that the attribute doesn't exist (but it clearly does).  And here is my regex:
    (?<=Text": ")(.*)(?=",) which correctly extracts the text I'm after from the above example.




    Here is the 'Extract Description' transform which precedes it showing that I can reference the attribute:


  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member Posts: 290   Unicorn
    Ah, tricky thing.

    I made this for you, it involves some master tricks though:

    The idea is to transform the Data to Documents with the operator that serves it purpose. You can then use the Loop Collection to read all the documents generated from your data. This is how it looks. And that Recall operator is part of the magic trick.

    Inside the Loop Collection super-operator (meaning, the kinds of operators you can put blocks inside), you do what I told you in the first place, that is to Cut Document and convert it back to data, because you need the example set, no? Now, there is another super-operator, named Branch. What does this thing do? First make sure that the Loop Collection super-operator has the set iteration macro turned on.


    The Branch operator is an if/else statement, and it has two parts:



    The first part, where we have the "Remember First" creates a storage item that saves the first example set. That is, when the iteration is 1.

    The second part uses Recall to retrieve the storage item and append it to the input, and calls Remember again, so that in the next iteration, it will recall the storage item, join it to the example, and remember, and so on, and on, and on...

    Your result is stored then in the storage item, so you need to Recall at the end of the process to get your results.

    It is a bit of sorcery, and I'm pretty sure there are simpler ways to do this, but this works.

    Here is the process.
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="9.2.000" expanded="true" height="82" name="(Examples)" width="90" x="112" y="136">
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
    <parameter key="file" value="/Users/master/files/text.json"/>
    <parameter key="extract_text_only" value="false"/>
    <parameter key="use_file_extension_as_type" value="true"/>
    <parameter key="content_type" value="txt"/>
    <parameter key="encoding" value="SYSTEM"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document (2)" width="90" x="45" y="136">
    <parameter key="file" value="/Users/master/files/text2.json"/>
    <parameter key="extract_text_only" value="false"/>
    <parameter key="use_file_extension_as_type" value="true"/>
    <parameter key="content_type" value="txt"/>
    <parameter key="encoding" value="SYSTEM"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="103" name="Documents to Data (2)" width="90" x="179" y="34">
    <parameter key="text_attribute" value="text"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    </operator>
    <connect from_op="Read Document" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/>
    <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 2"/>
    <connect from_op="Documents to Data (2)" from_port="example set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="85">
    <parameter key="select_attributes_and_weights" value="false"/>
    <list key="specify_weights"/>
    </operator>
    <operator activated="true" class="loop_collection" compatibility="9.2.000" expanded="true" height="82" name="Loop Collection" width="90" x="447" y="85">
    <parameter key="set_iteration_macro" value="true"/>
    <parameter key="macro_name" value="iteration"/>
    <parameter key="macro_start_value" value="1"/>
    <parameter key="unfold" value="false"/>
    <process expanded="true">
    <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="45" y="34">
    <parameter key="query_type" value="JsonPath"/>
    <list key="string_machting_queries"/>
    <parameter key="attribute_type" value="Nominal"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <parameter key="ignore_CDATA" value="true"/>
    <parameter key="assume_html" value="true"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries">
    <parameter key="translated" value="$.data.translations[*].translatedText"/>
    </list>
    <process expanded="true">
    <connect from_port="segment" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
    <parameter key="text_attribute" value="text"/>
    <parameter key="add_meta_information" value="false"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    </operator>
    <operator activated="true" class="branch" compatibility="9.2.000" expanded="true" height="103" name="Branch" width="90" x="380" y="34">
    <parameter key="condition_type" value="expression"/>
    <parameter key="expression" value="%{iteration} == 1"/>
    <parameter key="io_object" value="ANOVAMatrix"/>
    <parameter key="return_inner_output" value="true"/>
    <process expanded="true">
    <operator activated="true" class="remember" compatibility="9.2.000" expanded="true" height="68" name="Remember First" width="90" x="179" y="34">
    <parameter key="name" value="Partial Example Set"/>
    <parameter key="io_object" value="ExampleSet"/>
    <parameter key="store_which" value="1"/>
    <parameter key="remove_from_process" value="true"/>
    </operator>
    <connect from_port="input 1" to_op="Remember First" to_port="store"/>
    <connect from_op="Remember First" from_port="stored" to_port="input 1"/>
    <portSpacing port="source_condition" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    <portSpacing port="sink_input 2" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="recall" compatibility="9.2.000" expanded="true" height="68" name="Recall" width="90" x="45" y="34">
    <parameter key="name" value="Partial Example Set"/>
    <parameter key="io_object" value="ExampleSet"/>
    <parameter key="remove_from_store" value="true"/>
    </operator>
    <operator activated="true" class="append" compatibility="9.2.000" expanded="true" height="103" name="Append" width="90" x="246" y="85">
    <parameter key="datamanagement" value="double_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="merge_type" value="all"/>
    </operator>
    <operator activated="true" class="remember" compatibility="9.2.000" expanded="true" height="68" name="Remember (2)" width="90" x="380" y="85">
    <parameter key="name" value="Partial Example Set"/>
    <parameter key="io_object" value="ExampleSet"/>
    <parameter key="store_which" value="1"/>
    <parameter key="remove_from_process" value="true"/>
    </operator>
    <connect from_port="input 1" to_op="Append" to_port="example set 2"/>
    <connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_op="Remember (2)" to_port="store"/>
    <connect from_op="Remember (2)" from_port="stored" to_port="input 1"/>
    <portSpacing port="source_condition" spacing="0"/>
    <portSpacing port="source_input 1" spacing="84"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    <portSpacing port="sink_input 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="single" to_op="Cut Document" to_port="document"/>
    <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Branch" to_port="input 1"/>
    <connect from_op="Branch" from_port="input 1" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="recall" compatibility="9.2.000" expanded="true" height="68" name="Recall (2)" width="90" x="581" y="85">
    <parameter key="name" value="Partial Example Set"/>
    <parameter key="io_object" value="ExampleSet"/>
    <parameter key="remove_from_store" value="true"/>
    </operator>
    <connect from_op="(Examples)" from_port="out 1" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Recall (2)" from_port="result" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="236" resized="true" width="277" x="10" y="10">I was too lazy to generate examples, so I made a subprocess here. Change to your process to read the actual exampleset.</description>
    </process>
    </operator>
    </process>

    Hope this helps. This process is not something a newcomer is expected to do, but it's also not rocket science, so don't fall in despair and mention me if you have more questions.

    All the best,

    Rodrigo.

    yyhuangtopaz_ndbabrauskaite
  • paul_balaspaul_balas Member Posts: 11 Contributor I
    Thank you.  It took me a bit, but I understand what's happening here.  My only problem is I'm getting an error now 'Malformed JSON'.  I put a breakpoint beffore the 'Cut Document' operator which is complaining, and the collection looks fine to me...  Any ideas on how to debug would be appreciated!
  • paul_balaspaul_balas Member Posts: 11 Contributor I
    edited February 13
    This works!  Thank you.  I'm a bit clueless on WHY the 'Process Object ' control can act on the data in the 'Parse JSON from Data' control as it has no 'input'.

    Also confusing is why in the 'Process Object' control, I have another embedded 'Process Object' control, Then the 'Process Array', and finally the controls to 'Extract Properties' (but still unsure what 'Commit Row' does as well).

    Also, a strange behavior is that after the 'Parse JSON from Data', I can't reduce the attributes passed through (I selected 'keep example set' which passes through all the attributes).


    yyhuang
  • yyhuangyyhuang Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 182  RM Data Scientist
    Hi @paul_balas, you would need a trial license from @land.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,522   Unicorn

    let me very shortly summarize, why there are two phases in the parsing process of JSON. The first phase is, when you design your parser using the Process Object, Process Array and Extrat operators together with the Commit Row operator.
    This will give you a parser specification object that you can apply on JSON using one of the Parse JSON operators. The idea is as with models in RM, that you can take the construction of the parse specification offline, save it in the repository and use it from there. This allows a very flexible setup in cases where you have a dynamic json and want to configure that based on some data using Process logic with loops, branches, macros, etc...
    That should explain why the Process Object does not get any input data, because it just prepares the specification. This specification is then used internally by the Parse JSON operators, configures a so call push parser, which is very fast, who puts it into the result. Adds some inconvenience and a new way of doing it (although very similar to training a model and apply it), but is necessary for the speedup of roughly 450x compared to the standard operators...

    I strongly recommend to read our three blog posts about the extension:
    And of course we are very open for any feedback!

    With kind regards,
     Sebastian
    yyhuang
Sign In or Register to comment.