Extracting text from a record

paul_balas · February 2019

Hi,

Is there an easy control to use to extract the text from the following field:

{

"data": {

"translations": [

{

"translatedText": "020114 - SECURITAS - Security - AE Menor - 14x7 - Van",

"detectedSourceLanguage": "es"

}

]

}

}

I want to extract just the following text: 020114 - SECURITAS - Security - AE Menor - 14x7 - Van

sgenzer · February 2019

hi all - so yes that is very good "sorcery" @rfuentealba. The reason is that the existing JSON parsing tools are currently out-of-date. There are some updates in the pipeline but to be honest, I would STRONGLY suggest just using Old World Computing's new Web Automation extension (from the marketplace). I can do your JSON parsing very elegantly in about 2 min like this:

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" breakpoints="after" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
        <parameter key="text" value="{&#10;&#10;  &quot;data&quot;: {&#10;&#10;    &quot;translations&quot;: [&#10;&#10;      {&#10;&#10;        &quot;translatedText&quot;: &quot;020114 - SECURITAS - Security - AE Menor - 14x7 - Van&quot;,&#10;&#10;        &quot;detectedSourceLanguage&quot;: &quot;es&quot;&#10;&#10;      }&#10;&#10;    ]&#10;&#10;  }&#10;&#10;}"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
        <description align="center" color="transparent" colored="false" width="126">this is your JSON</description>
      </operator>
      <operator activated="true" breakpoints="after" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
        <parameter key="text_attribute" value="json"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="rmx_webautomation:process_json_object" compatibility="2.2.431" expanded="true" height="82" name="Process Object" width="90" x="179" y="238">
        <process expanded="true">
          <operator activated="true" class="rmx_webautomation:process_json_object" compatibility="2.2.431" expanded="true" height="82" name="Process Object (2)" width="90" x="179" y="34">
            <parameter key="property_name" value="data"/>
            <process expanded="true">
              <operator activated="true" class="rmx_webautomation:process_json_array" compatibility="2.2.431" expanded="true" height="82" name="Process Array" width="90" x="179" y="34">
                <parameter key="property_name" value="translations"/>
                <parameter key="array_type" value="objects"/>
                <parameter key="create_id_attribute" value="false"/>
                <process expanded="true">
                  <operator activated="true" class="rmx_webautomation:extract_json_properties" compatibility="2.2.431" expanded="true" height="82" name="Extract Properties" width="90" x="112" y="34">
                    <list key="extract_properties">
                      <parameter key="translatedText" value="translatedText.polynominal"/>
                    </list>
                    <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
                    <parameter key="time zone" value="SYSTEM"/>
                  </operator>
                  <operator activated="true" class="rmx_webautomation:commit_row" compatibility="2.2.431" expanded="true" height="82" name="Commit Row" width="90" x="246" y="34"/>
                  <connect from_port="parse specification" to_op="Extract Properties" to_port="parse specifications 1"/>
                  <connect from_op="Extract Properties" from_port="parse specifications 1" to_op="Commit Row" to_port="parse specifications 1"/>
                  <connect from_op="Commit Row" from_port="parse specifications 1" to_port="parse specifications 1"/>
                  <portSpacing port="source_parse specification" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_parse specifications 1" spacing="0"/>
                  <portSpacing port="sink_parse specifications 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="parse specification" to_op="Process Array" to_port="parse specification"/>
              <connect from_op="Process Array" from_port="parse specifications 1" to_port="parse specifications 1"/>
              <portSpacing port="source_parse specification" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_parse specifications 1" spacing="0"/>
              <portSpacing port="sink_parse specifications 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="parse specification" to_op="Process Object (2)" to_port="parse specification"/>
          <connect from_op="Process Object (2)" from_port="parse specifications 1" to_port="parse specifications 1"/>
          <portSpacing port="source_parse specification" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_parse specifications 1" spacing="0"/>
          <portSpacing port="sink_parse specifications 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="rmx_webautomation:parse_json_data" compatibility="2.2.431" expanded="true" height="103" name="Parse JSON from Data" width="90" x="380" y="85">
        <parameter key="attribute" value="json"/>
        <parameter key="keep_example_set" value="false"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Parse JSON from Data" to_port="example set"/>
      <connect from_op="Process Object" from_port="parse specifications 1" to_op="Parse JSON from Data" to_port="parse specifications 1"/>
      <connect from_op="Parse JSON from Data" from_port="example set 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

paul_balas · February 2019

Much easier! Disappointing that some of these controls are so buggy. This solved a problem I've been struggling with for about 4 hours. Thank you!

rfuentealba · February 2019

Hello,

Please find this XML file:

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
        <parameter key="file" value="/Users/master/files/text.json"/>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="txt"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="34">
        <parameter key="query_type" value="JsonPath"/>
        <list key="string_machting_queries"/>
        <parameter key="attribute_type" value="Nominal"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <parameter key="ignore_CDATA" value="true"/>
        <parameter key="assume_html" value="true"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries">
          <parameter key="translated" value="$.data.translations[*].translatedText"/>
        </list>
        <process expanded="true">
          <connect from_port="segment" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="34">
        <parameter key="text_attribute" value="text"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

This is how the process looks:

It uses jsonPath to extract information.

You can use this site to explore and understand how to use jsonPath: http://jsonpath.com/

Hope this helps,

Rodrigo.

paul_balas · February 2019

Thank you for the help! My output in my sample above is an Example, not a document. Do I need to convert the exa output into a document first?

paul_balas · February 2019

I've tried another way, but to no avail. I'm using the 'Replace' operator. I want to use a regex to extract text from a record like this.

Image: https://us.v-cdn.net/6030995/uploads/editor/zp/xhyrl7vizjht.png

The goal is to extract the text: Incident with Vehicle
excluding any of the other text before or after it.

Here is my process which is complaining that the attribute doesn't exist (but it clearly does). And here is my regex:
(?<=Text": ")(.*)(?=",) which correctly extracts the text I'm after from the above example.

Image: https://us.v-cdn.net/6030995/uploads/editor/a4/oyoatsic78bn.png

Here is the 'Extract Description' transform which precedes it showing that I can reference the attribute:

Image: https://us.v-cdn.net/6030995/uploads/editor/1t/fv7weq9xymzu.png

rfuentealba · February 2019

Ah, tricky thing.

I made this for you, it involves some master tricks though:

The idea is to transform the Data to Documents with the operator that serves it purpose. You can then use the Loop Collection to read all the documents generated from your data. This is how it looks. And that Recall operator is part of the magic trick.

Image: https://us.v-cdn.net/6030995/uploads/editor/m1/47c32ebslb81.png

Inside the Loop Collection super-operator (meaning, the kinds of operators you can put blocks inside), you do what I told you in the first place, that is to Cut Document and convert it back to data, because you need the example set, no? Now, there is another super-operator, named Branch. What does this thing do? First make sure that the Loop Collection super-operator has the set iteration macro turned on.

Image: https://us.v-cdn.net/6030995/uploads/editor/lb/82dvfkyrmnoq.png

The Branch operator is an if/else statement, and it has two parts:

Image: https://us.v-cdn.net/6030995/uploads/editor/qi/m63ctcdub3z5.png

The first part, where we have the "Remember First" creates a storage item that saves the first example set. That is, when the iteration is 1.

The second part uses Recall to retrieve the storage item and append it to the input, and calls Remember again, so that in the next iteration, it will recall the storage item, join it to the example, and remember, and so on, and on, and on...

Your result is stored then in the storage item, so you need to Recall at the end of the process to get your results.

It is a bit of sorcery, and I'm pretty sure there are simpler ways to do this, but this works.

Here is the process.

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="subprocess" compatibility="9.2.000" expanded="true" height="82" name="(Examples)" width="90" x="112" y="136"><br>        <process expanded="true"><br>          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34"><br>            <parameter key="file" value="/Users/master/files/text.json"/><br>            <parameter key="extract_text_only" value="false"/><br>            <parameter key="use_file_extension_as_type" value="true"/><br>            <parameter key="content_type" value="txt"/><br>            <parameter key="encoding" value="SYSTEM"/><br>          </operator><br>          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document (2)" width="90" x="45" y="136"><br>            <parameter key="file" value="/Users/master/files/text2.json"/><br>            <parameter key="extract_text_only" value="false"/><br>            <parameter key="use_file_extension_as_type" value="true"/><br>            <parameter key="content_type" value="txt"/><br>            <parameter key="encoding" value="SYSTEM"/><br>          </operator><br>          <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="103" name="Documents to Data (2)" width="90" x="179" y="34"><br>            <parameter key="text_attribute" value="text"/><br>            <parameter key="add_meta_information" value="true"/><br>            <parameter key="datamanagement" value="double_sparse_array"/><br>            <parameter key="data_management" value="auto"/><br>          </operator><br>          <connect from_op="Read Document" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/><br>          <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 2"/><br>          <connect from_op="Documents to Data (2)" from_port="example set" to_port="out 1"/><br>          <portSpacing port="source_in 1" spacing="0"/><br>          <portSpacing port="sink_out 1" spacing="0"/><br>          <portSpacing port="sink_out 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="85"><br>        <parameter key="select_attributes_and_weights" value="false"/><br>        <list key="specify_weights"/><br>      </operator><br>      <operator activated="true" class="loop_collection" compatibility="9.2.000" expanded="true" height="82" name="Loop Collection" width="90" x="447" y="85"><br>        <parameter key="set_iteration_macro" value="true"/><br>        <parameter key="macro_name" value="iteration"/><br>        <parameter key="macro_start_value" value="1"/><br>        <parameter key="unfold" value="false"/><br>        <process expanded="true"><br>          <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="45" y="34"><br>            <parameter key="query_type" value="JsonPath"/><br>            <list key="string_machting_queries"/><br>            <parameter key="attribute_type" value="Nominal"/><br>            <list key="regular_expression_queries"/><br>            <list key="regular_region_queries"/><br>            <list key="xpath_queries"/><br>            <list key="namespaces"/><br>            <parameter key="ignore_CDATA" value="true"/><br>            <parameter key="assume_html" value="true"/><br>            <list key="index_queries"/><br>            <list key="jsonpath_queries"><br>              <parameter key="translated" value="$.data.translations[*].translatedText"/><br>            </list><br>            <process expanded="true"><br>              <connect from_port="segment" to_port="document 1"/><br>              <portSpacing port="source_segment" spacing="0"/><br>              <portSpacing port="sink_document 1" spacing="0"/><br>              <portSpacing port="sink_document 2" spacing="0"/><br>            </process><br>          </operator><br>          <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34"><br>            <parameter key="text_attribute" value="text"/><br>            <parameter key="add_meta_information" value="false"/><br>            <parameter key="datamanagement" value="double_sparse_array"/><br>            <parameter key="data_management" value="auto"/><br>          </operator><br>          <operator activated="true" class="branch" compatibility="9.2.000" expanded="true" height="103" name="Branch" width="90" x="380" y="34"><br>            <parameter key="condition_type" value="expression"/><br>            <parameter key="expression" value="%{iteration} == 1"/><br>            <parameter key="io_object" value="ANOVAMatrix"/><br>            <parameter key="return_inner_output" value="true"/><br>            <process expanded="true"><br>              <operator activated="true" class="remember" compatibility="9.2.000" expanded="true" height="68" name="Remember First" width="90" x="179" y="34"><br>                <parameter key="name" value="Partial Example Set"/><br>                <parameter key="io_object" value="ExampleSet"/><br>                <parameter key="store_which" value="1"/><br>                <parameter key="remove_from_process" value="true"/><br>              </operator><br>              <connect from_port="input 1" to_op="Remember First" to_port="store"/><br>              <connect from_op="Remember First" from_port="stored" to_port="input 1"/><br>              <portSpacing port="source_condition" spacing="0"/><br>              <portSpacing port="source_input 1" spacing="0"/><br>              <portSpacing port="source_input 2" spacing="0"/><br>              <portSpacing port="sink_input 1" spacing="0"/><br>              <portSpacing port="sink_input 2" spacing="0"/><br>            </process><br>            <process expanded="true"><br>              <operator activated="true" class="recall" compatibility="9.2.000" expanded="true" height="68" name="Recall" width="90" x="45" y="34"><br>                <parameter key="name" value="Partial Example Set"/><br>                <parameter key="io_object" value="ExampleSet"/><br>                <parameter key="remove_from_store" value="true"/><br>              </operator><br>              <operator activated="true" class="append" compatibility="9.2.000" expanded="true" height="103" name="Append" width="90" x="246" y="85"><br>                <parameter key="datamanagement" value="double_array"/><br>                <parameter key="data_management" value="auto"/><br>                <parameter key="merge_type" value="all"/><br>              </operator><br>              <operator activated="true" class="remember" compatibility="9.2.000" expanded="true" height="68" name="Remember (2)" width="90" x="380" y="85"><br>                <parameter key="name" value="Partial Example Set"/><br>                <parameter key="io_object" value="ExampleSet"/><br>                <parameter key="store_which" value="1"/><br>                <parameter key="remove_from_process" value="true"/><br>              </operator><br>              <connect from_port="input 1" to_op="Append" to_port="example set 2"/><br>              <connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 1"/><br>              <connect from_op="Append" from_port="merged set" to_op="Remember (2)" to_port="store"/><br>              <connect from_op="Remember (2)" from_port="stored" to_port="input 1"/><br>              <portSpacing port="source_condition" spacing="0"/><br>              <portSpacing port="source_input 1" spacing="84"/><br>              <portSpacing port="source_input 2" spacing="0"/><br>              <portSpacing port="sink_input 1" spacing="0"/><br>              <portSpacing port="sink_input 2" spacing="0"/><br>            </process><br>          </operator><br>          <connect from_port="single" to_op="Cut Document" to_port="document"/><br>          <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/><br>          <connect from_op="Documents to Data" from_port="example set" to_op="Branch" to_port="input 1"/><br>          <connect from_op="Branch" from_port="input 1" to_port="output 1"/><br>          <portSpacing port="source_single" spacing="0"/><br>          <portSpacing port="sink_output 1" spacing="0"/><br>          <portSpacing port="sink_output 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="recall" compatibility="9.2.000" expanded="true" height="68" name="Recall (2)" width="90" x="581" y="85"><br>        <parameter key="name" value="Partial Example Set"/><br>        <parameter key="io_object" value="ExampleSet"/><br>        <parameter key="remove_from_store" value="true"/><br>      </operator><br>      <connect from_op="(Examples)" from_port="out 1" to_op="Data to Documents" to_port="example set"/><br>      <connect from_op="Data to Documents" from_port="documents" to_op="Loop Collection" to_port="collection"/><br>      <connect from_op="Recall (2)" from_port="result" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>      <description align="center" color="yellow" colored="false" height="236" resized="true" width="277" x="10" y="10">I was too lazy to generate examples, so I made a subprocess here. Change to your process to read the actual exampleset.</description><br>    </process><br>  </operator><br></process><br><br>

Hope this helps. This process is not something a newcomer is expected to do, but it's also not rocket science, so don't fall in despair and mention me if you have more questions.

All the best,

Rodrigo.

paul_balas · February 2019

Thank you. It took me a bit, but I understand what's happening here. My only problem is I'm getting an error now 'Malformed JSON'. I put a breakpoint beffore the 'Cut Document' operator which is complaining, and the collection looks fine to me... Any ideas on how to debug would be appreciated!

paul_balas · February 2019

This works! Thank you. I'm a bit clueless on WHY the 'Process Object ' control can act on the data in the 'Parse JSON from Data' control as it has no 'input'.

Also confusing is why in the 'Process Object' control, I have another embedded 'Process Object' control, Then the 'Process Array', and finally the controls to 'Extract Properties' (but still unsure what 'Commit Row' does as well).

Also, a strange behavior is that after the 'Parse JSON from Data', I can't reduce the attributes passed through (I selected 'keep example set' which passes through all the attributes).

yyhuang · February 2019

Hi @paul_balas, you would need a trial license from @land.

land · February 2019

Hi @paul_balas,

let me very shortly summarize, why there are two phases in the parsing process of JSON. The first phase is, when you design your parser using the Process Object, Process Array and Extrat operators together with the Commit Row operator.

This will give you a parser specification object that you can apply on JSON using one of the Parse JSON operators. The idea is as with models in RM, that you can take the construction of the parse specification offline, save it in the repository and use it from there. This allows a very flexible setup in cases where you have a dynamic json and want to configure that based on some data using Process logic with loops, branches, macros, etc...

That should explain why the Process Object does not get any input data, because it just prepares the specification. This specification is then used internally by the Parse JSON operators, configures a so call push parser, which is very fast, who puts it into the result. Adds some inconvenience and a new way of doing it (although very similar to training a model and apply it), but is necessary for the speedup of roughly 450x compared to the standard operators...

I strongly recommend to read our three blog posts about the extension:

And of course we are very open for any feedback!

With kind regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extracting text from a record

Best Answers

Answers