"text mining: process documents -

simon_knoll · June 2010

Hello all,
i want to cluster a set of xml documents by extracting content out of the xml document by xpath querys. once i have extracted the content from the xml files i want to cluster them with a given clustering algorithm (eg kmeans)

so the textual content im extracting in that way

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="688" width="1007">
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="179" y="120">
        <list key="text_directories">
          <parameter key="wsdls" value="/home/simon/work/workspace/MasterThesis/wsdls"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <process expanded="true" height="688" width="1007">
          <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document (3)" width="90" x="458" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="operation" value="//wsdl:operation/@name"/&gt;
            </list>
            <list key="namespaces">
              <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
            </list>
            <parameter key="ignore_CDATA" value="false"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <process expanded="true" height="688" width="1007">
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information (3)" width="90" x="458" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="extraction" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="operation" value="//wsdl:operation/@name"/&gt;
                </list>
                <list key="namespaces">
                  <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
                </list>
                <parameter key="ignore_CDATA" value="false"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information (3)" to_port="document"/>
              <connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document (3)" to_port="document"/>
          <connect from_op="Cut Document (3)" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

with that i have several examples for each extracted content from one file (there are several tags matching my xpath query) if im feeding now for instance a kmeans algorithm with the whole example set, the algorithm clusters the extracted content (the single results form the xpath query) and not the xml documents regarding its extracted content.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="688" width="1007">
      <operator activated="false" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="715" y="750">
        <list key="text_directories">
          <parameter key="wsdls" value="/home/simon/work/workspace/MasterThesis/wsdls"/>
        </list>
        <parameter key="file_pattern" value="*.wsdl"/>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="content_type" value="xml"/>
        <parameter key="create_word_vector" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="false" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="246" y="75">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="operationName" value="//wsdl:operation/@name"/&gt;
              <parameter key="documentation" value="//wsdl:documentation/text()"/>
            </list>
            <list key="namespaces">
              <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
            </list>
            <parameter key="ignore_CDATA" value="false"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="false" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="514" y="120">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="all" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="wsdls" value="/home/simon/work/workspace/MasterThesis/wsdls"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document (3)" width="90" x="380" y="75">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="operation" value="//wsdl:operation/@name"/&gt;
            </list>
            <list key="namespaces">
              <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
            </list>
            <parameter key="ignore_CDATA" value="false"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information (3)" width="90" x="179" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="extraction" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="operation" value="//wsdl:operation/@name"/&gt;
                </list>
                <list key="namespaces">
                  <parameter key="wsdl" value="http://schemas.xmlsoap.org/wsdl/"/>
                </list>
                <parameter key="ignore_CDATA" value="false"/>
                <parameter key="assume_html" value="false"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information (3)" to_port="document"/>
              <connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document (3)" to_port="document"/>
          <connect from_op="Cut Document (3)" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_means" expanded="true" height="76" name="Clustering" width="90" x="581" y="120"/>
      <connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

now my question is, what i have to do, that i can cluster the single documents regarding the extracted content of them. i think there must be a way, as there is meta data about the filename entaild.
i hope i described my problem comprehensible

regards
simon knoll

simon_knoll · June 2010

maybe one understand what i want to do with following example. here instead of cut document and extract information i just use tokenize, there it works how it should, because tokenize is generating attributes:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="688" width="1007">
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="wsdls" value="/home/simon/wsdls"/>
        </list>
        <process expanded="true" height="688" width="1007">
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="252" y="286"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
        <parameter key="name" value="metadata_file"/>
        <parameter key="target_role" value="id"/>
      </operator>
      <operator activated="true" class="k_means" expanded="true" height="76" name="Clustering" width="90" x="380" y="30">
        <parameter key="k" value="4"/>
      </operator>
      <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

TobiasMalbrecht · June 2010

Hi Simon,

did you already notice the [tt]Combine Documents[/tt] operator? I think it should be available in the text processing extension right now. Maybe it is of use for you - e.g. if you put it in the flow after the [tt]Cut Document[/tt] operator.

Kind regards,
Tobias

simon_knoll · June 2010

Hi Tobias,
thank you for your advice. this brings me directly to my next question.
i am trying to classify different webservices. for that i have different types of documents related to a service. so i extract information from the documents and weight information by the type of the document. also i have weight factors for the content within every single document, for instance terms within longer text passages are more important then terms from shorter ones.

So i have mainly these questions:
* how can i combine different weightings (eg from the document level to within document level) during the extraction process
* if i weight a text passage how can i retain the weight for all the entailed terms if i tokenize them
* how ,if i have different documents as source of information, i can tell this to the k-means operator, that certain terms describe one example which is to cluster(in my case a service)
i've tried this by setting the role id, but this was useless.

i really hope that this explanation of my problem is understandable
best regards
simon

simon_knoll · July 2010

Hello all,
does someone has an advice for me regarding my last post?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"text mining: process documents -

Answers