Options

"How to get Meta Data with the Data to Documents and Process Documents operators"

dramhamptondramhampton Member Posts: 9 Contributor II
edited May 2019 in Help
I'm performing text analytics and am struggling with Meta Data.

In the toy process below, there should be meta data available to the Data to Documents operator, but if you want to specify weights and click on Edit List, the source attribute doesn't populate, so you have to type it manually.  Seems an unnecessary chore if there are several attributes to be listed, is there a 'proper' way to do this?

Also, the Process Documents operator loses all the meta data (quite understandably because it is creating a bunch of new attributes from the text it is fed) - what is best practice for restoring the meta data so that subsequent operators can be set up easily?

Many thanks!

David


<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Root" origin="GENERATED_SAMPLE">
    <parameter key="logverbosity" value="warning"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
        <parameter key="repository_entry" value="../../data/Golf"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="34">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Best Answer

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    I have had this metadata loss problem before and the best way I have found to handle it is to create an id (if you don't already have one) for each document, then multiply the data, use Process Documents, and then merge back in the metadata from the earlier dataset.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    dramhamptondramhampton Member Posts: 9 Contributor II
    Good suggestion on the way to handle the inevitable loss of meta data caused by Process Documents, many thanks Brian.

    Any ideas about why Data to Documents cannot use the Meta Data provided to it by the previous operator?
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @dramhampton - so both good questions.

    1. Re specifying attributes and weights in Data to Documents, the way that works is that any text attribute is used by default. Hence I rarely go in here as I just make them text ahead of time. Yes the attribute list does not propagate into this list - likely a known bug in the TP extension. I will investigate.

    2. (I think Brian answered this faster than I could! I was going to say the same thing... :smiley: )

    Scott
  • Options
    dramhamptondramhampton Member Posts: 9 Contributor II
    Thank you both.  This is very encouraging, I can feel I am getting closer to the point where I realise that I am doing it all wrong.

    But here's the thing: I have created a toy process below that does 'Data to Documents' and 'Process Documents' on a simple dataset that just has two text columns.  The version pasted below has these two columns manually selected in the Data to Documents operator, and it works just fine - creates a document term matrix.  But if you now uncheck the 'select attributes and weights' box in Data to Documents (which should have the same result if I am reading Scott right), the Process Documents operator fails to produce a document term matrix.  So the only way I can get the process to work is to manually specify all the text attributes that I want to use - which to my original point is very clunky because the Data to Documents operator appears to ignore the meta data it has been presented...


    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="Outlook|Wind"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34">
            <parameter key="select_attributes_and_weights" value="true"/>
            <list key="specify_weights">
              <parameter key="Outlook" value="1.0"/>
              <parameter key="Wind" value="1.0"/>
            </list>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="581" y="34">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    dramhamptondramhampton Member Posts: 9 Contributor II
    Bingo.  Thank you so much for taking the time to put this together Scott!

    For the benefit of those who will read this later, the key points are:
    - Before processing documents, make sure your text data is of type 'text' rather than 'polynominal'.  Text means its a string of any number of words, such as reviews of products on a website, whereas polynominal means it is nominal data that has a finite number of different values (even if it's a large number of them) - such as names of items for sale on a website.  It would be reasonable to make a Pareto Chart from Polynominal data to see which values occur most often, but makes no sense to do so with text. It's a subtle difference and I will admit to having assumed that RapidMiner treated them the same.

    So Scott introduced a Nominal to Text operator, which forced the polynominal attributes to text.  That was why the data to documents operator was not seeing the meta data, it is looking for text attributes, not polynominal ones.  

    And finally, to put the lid on it, he pointed out that it is actually easier to use just a single operator - Process Documents from Data - instead of Data to Documents and Process Documents.

    All great stuff, many thanks Scott!

    David
Sign In or Register to comment.