RapidMiner

Contributor II SebastianLoh
Contributor II

Re: Text-classification: Data from XML and multiple keywords

Hi tron42,

can you explain your intention again please? What I understood from

tron42 wrote:


2) I'm building a small csv-example for testing, something like:
   title;abstract;keyword;keyword;keyword;.....

As you can see I have multiple columns, each with one keyword. Is it possible to mark more than one column as an label? I tried, but when I change the next column, the previous is changing back.


is, that each keyword is an indicator/label. For example keyword1 indicates the sentiment good/bad review for quality, keyword2 indicates good/bad review for service, keyword3 for....

So then you learn on the attribute "abstract" (which you need to process with the Textprocessing operators, of course, Process Documents, and inside at least tokenization and possibly some Stopword Filter and Filter by legth) one classifiaction model for "quality", one for "service", and so on.

However, you seem to have something different in you mind.

Ciao Sebastian
tron42
N/A

Re: Text-classification: Data from XML and multiple keywords

Sebastian Loh wrote:

Hi tron42,

can you explain your intention again please? What I understood from

is, that each keyword is an indicator/label. For example keyword1 indicates the sentiment good/bad review for quality, keyword2 indicates good/bad review for service, keyword3 for....



Hi Sebastian,

each keyword is not only an indicator, it describes the text. For example I have a text about China, so the keywords are: china, asia, hongkong, north korea, ... and a lot more keywords which characterises the article. I want to train those relationships between the text and keywords, so that I can predict possible keywords for an unknown text.

Regards,
David
Contributor II rakirk
Contributor II

Re: Text-classification: Data from XML and multiple keywords

I've been doing something similar to tron42 in that I want to process XML using XPath and Extract Information operator. I am using an XPath to query every node and it only returns the first result. My problem is that I want to extract all of the elements from a particular document and it seems like Extract Information terminates after first discovering an element that matches the XPath query. I will show a simple example of an XML file and then the process being used.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.002">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.1.002" expanded="true" name="Process">
   <process expanded="true" height="431" width="547">
     <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files (3)" width="90" x="179" y="255">
       <list key="text_directories">
         <parameter key="SGML" value="C:\Users\Kirk\Desktop\tests"/>
       </list>
       <parameter key="extract_text_only" value="false"/>
       <parameter key="encoding" value="UTF-8"/>
       <parameter key="create_word_vector" value="false"/>
       <parameter key="prune_below_absolute" value="5"/>
       <parameter key="prune_above_absolute" value="1000000"/>
       <process expanded="true" height="650" width="710">
         <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information (2)" width="90" x="45" y="210">
           <parameter key="query_type" value="XPath"/>
           <list key="string_machting_queries">
             <parameter key="intro_m/d" value="&lt;intro_m\.*&gt;.&lt;/intro\.*&gt;"/>
           </list>
           <list key="regular_expression_queries"/>
           <list key="regular_region_queries"/>
           <list key="xpath_queries">
             <parameter key="Move 1" value="//title"/>
           </list>
           <list key="namespaces"/>
           <parameter key="assume_html" value="false"/>
           <list key="index_queries"/>
         </operator>
         <connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
         <connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Process Documents from Files (3)" from_port="example set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>


Here is the XML example:

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book category="COOKING">
 <title lang="en">Everyday Italian</title>
 <author>Giada De Laurentiis</author>
 <year>2005</year>
 <price>30.00</price>
</book>

<book category="CHILDREN">
 <title lang="en">Harry Potter</title>
 <author>J K. Rowling</author>
 <year>2005</year>
 <price>29.99</price>
</book>

<book category="WEB">
 <title lang="en">XQuery Kick Start</title>
 <author>James McGovern</author>
 <author>Per Bothner</author>
 <author>Kurt Cagle</author>
 <author>James Linn</author>
 <author>Vaidyanathan Nagarajan</author>
 <year>2003</year>
 <price>49.99</price>
</book>

<book category="WEB">
 <title lang="en">Learning XML</title>
 <author>Erik T. Ray</author>
 <year>2003</year>
 <price>39.95</price>
</book>

</bookstore>


Results:
<title lang="en">Everyday Italian</title>

Desired results: (4 separate examples)
<title lang="en">Everyday Italian</title>
<title lang="en">Harry Potter</title>
<title lang="en">XQuery Kick Star</title>
<title lang="en">Learning XML</title>
Highlighted
RM Certified Expert
RM Certified Expert

Re: Text-classification: Data from XML and multiple keywords

Hi,
you will have to use the Cut Document operator together with the XPath querry to get all matches as documents in the inner subprocess of Cut Document.

Greetings,
  Sebastian
Old World Computing - Establishing the Future

Professional consulting for your Data Science problems

Polls
How can RapidMiner increase participation in our new competitions?
Twitter Feed