Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"text mining: process documents -

simon_knollsimon_knoll Member Posts: 40 Contributor II
edited June 2019 in Help
Hello all,
i want to cluster a set of xml documents by extracting content out of the xml document by xpath querys. once i have extracted the content from the xml files i want to cluster them with a given clustering algorithm (eg kmeans)

so the textual content im extracting in that way
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="688" width="1007">
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="179" y="120">
        <list key="text_directories">
          <parameter key="wsdls" value="/home/simon/work/workspace/MasterThesis/wsdls"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <process expanded="true" height="688" width="1007">
          <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document (3)" width="90" x="458" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="operation" value="//wsdl:operation/@name"/&gt;
            </list>
            <list key="namespaces">
              <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
            </list>
            <parameter key="ignore_CDATA" value="false"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <process expanded="true" height="688" width="1007">
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information (3)" width="90" x="458" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="extraction" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="operation" value="//wsdl:operation/@name"/&gt;
                </list>
                <list key="namespaces">
                  <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
                </list>
                <parameter key="ignore_CDATA" value="false"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information (3)" to_port="document"/>
              <connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document (3)" to_port="document"/>
          <connect from_op="Cut Document (3)" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>
with that i have several examples for each extracted content from one file  (there are several tags matching my xpath query) if im feeding now for instance a kmeans algorithm with the whole example set, the algorithm clusters the extracted content (the single results form the xpath query) and not the xml documents regarding its extracted content.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="688" width="1007">
      <operator activated="false" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="715" y="750">
        <list key="text_directories">
          <parameter key="wsdls" value="/home/simon/work/workspace/MasterThesis/wsdls"/>
        </list>
        <parameter key="file_pattern" value="*.wsdl"/>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="content_type" value="xml"/>
        <parameter key="create_word_vector" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="false" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="246" y="75">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="operationName" value="//wsdl:operation/@name"/&gt;
              <parameter key="documentation" value="//wsdl:documentation/text()"/>
            </list>
            <list key="namespaces">
              <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
            </list>
            <parameter key="ignore_CDATA" value="false"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="false" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="514" y="120">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="all" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="wsdls" value="/home/simon/work/workspace/MasterThesis/wsdls"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document (3)" width="90" x="380" y="75">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="operation" value="//wsdl:operation/@name"/&gt;
            </list>
            <list key="namespaces">
              <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
            </list>
            <parameter key="ignore_CDATA" value="false"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information (3)" width="90" x="179" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="extraction" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="operation" value="//wsdl:operation/@name"/&gt;
                </list>
                <list key="namespaces">
                  <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
                </list>
                <parameter key="ignore_CDATA" value="false"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information (3)" to_port="document"/>
              <connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document (3)" to_port="document"/>
          <connect from_op="Cut Document (3)" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_means" expanded="true" height="76" name="Clustering" width="90" x="581" y="120"/>
      <connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
now my question is, what i have to do, that i can cluster the single documents regarding the extracted content of them. i think there must be a way, as there is meta data about the filename entaild.
i hope i described my problem comprehensible

regards
simon knoll

Answers

  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    maybe one understand what i want to do with following example. here instead of cut document and extract information i just use tokenize, there it works how it should, because tokenize is generating attributes:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="688" width="1007">
          <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
            <list key="text_directories">
              <parameter key="wsdls" value="/home/simon/wsdls"/>
            </list>
            <process expanded="true" height="688" width="1007">
              <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="252" y="286"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
            <parameter key="name" value="metadata_file"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="k_means" expanded="true" height="76" name="Clustering" width="90" x="380" y="30">
            <parameter key="k" value="4"/>
          </operator>
          <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
          <connect from_op="Process Documents from Files" from_port="example set" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 295 RM Product Management
    Hi Simon,

    did you already notice the [tt]Combine Documents[/tt] operator? I think it should be available in the text processing extension right now. Maybe it is of use for you - e.g. if you put it in the flow after the [tt]Cut Document[/tt] operator.

    Kind regards,
    Tobias
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hi Tobias,
    thank you for your advice. this brings me directly to my next question.
    i am trying to classify different webservices. for that i have different types of documents related to a service. so i extract information from the documents and weight information by the type of the document. also i have weight factors for the content within every single document, for instance terms within longer text passages are more important then terms from shorter ones.

    So i have mainly these questions:
    * how can i combine different weightings (eg from the document level to within document level) during the extraction process
    * if i weight a text passage how can i retain the weight for all the entailed terms if i tokenize them
    * how ,if i have different documents as source of information, i can tell this to the k-means operator, that certain terms describe one example which is to cluster(in my case a service)
      i've tried this by setting the role id, but this was useless.

    i really hope that this explanation of my problem is understandable
    best regards
    simon
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hello all,
    does someone has an advice for me regarding my last post?
Sign In or Register to comment.