Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
XPATH returns no results
Hi,
I am trying to grab the text of abstracts from a journal using XPATH in cut documents. I downloaded a test set and saved as html and am using Process Document from Files with a Cut document operator nested inside. The site I am testing is here:
http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9221.2010.00797.x/abstract
Using Firebug in FireFox, I inspected the element and determined that the XPATH is both:
/html/body/div[3]/div/div[5]/div[4]/div[3]/div/div[2]/p
and,
//div[@class='para']
I simplified the first one to: /div/div/div/div/div/div/div[2]/p. I tested both XPATH queries online using Google Docs and the extraction worked fine. However, I have not been able to successfuly replicate the result in RapidMiner. Am I missing something in the namespace? I have tried various versions of the XPATH syntax and the namespace settings. Note that I have run an extract content sequence etc. in parellel with a port multiplier and have not had problems getting the text tokenized, turned into word vectors etc. Here is the XML for just a simple Cut Document inside Process Doc from Files chain.
My XML:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="655" width="918">
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="149" y="116">
<list key="text_directories">
<parameter key="all-pp" value="/Users/williamfchiu/Desktop/politicalpsych_test"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<process expanded="true" height="637" width="867">
<operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="112" y="210">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="fulltext" value="//div/div/div/div/div/div/div[2]/p"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="655" width="919">
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
William
I am trying to grab the text of abstracts from a journal using XPATH in cut documents. I downloaded a test set and saved as html and am using Process Document from Files with a Cut document operator nested inside. The site I am testing is here:
http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9221.2010.00797.x/abstract
Using Firebug in FireFox, I inspected the element and determined that the XPATH is both:
/html/body/div[3]/div/div[5]/div[4]/div[3]/div/div[2]/p
and,
//div[@class='para']
I simplified the first one to: /div/div/div/div/div/div/div[2]/p. I tested both XPATH queries online using Google Docs and the extraction worked fine. However, I have not been able to successfuly replicate the result in RapidMiner. Am I missing something in the namespace? I have tried various versions of the XPATH syntax and the namespace settings. Note that I have run an extract content sequence etc. in parellel with a port multiplier and have not had problems getting the text tokenized, turned into word vectors etc. Here is the XML for just a simple Cut Document inside Process Doc from Files chain.
My XML:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="655" width="918">
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="149" y="116">
<list key="text_directories">
<parameter key="all-pp" value="/Users/williamfchiu/Desktop/politicalpsych_test"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<process expanded="true" height="637" width="867">
<operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="112" y="210">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="fulltext" value="//div/div/div/div/div/div/div[2]/p"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="655" width="919">
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
William
0
Answers
Try this //h:div[2]/h:p
the h: is some sort of html namespace
regards
Andrew
awchisholm ist right you have to use the //h: .
Try this Xpath Command:
//h:div[@class="para"]/text()
This should give you the text in the the abstract box of your site.
Remember that you cant just use the firebug / xpath generator Xpath Commands of firefox. You need to add some stuff like the namespace etc.
No you should be able to get some results using our examples.
Greetings
Miguel
I tried Andrew's suggestion and my first few tries didn't seem to work. I will try the second version and see what happens.
Also, do you (or does anyone) happen to know why I can't connect the output of a Cut Document to Extract Content or Tokenize operators. The output seems to be "doc" but the latter two report an error message saying that IOObjectCollection was delivered rather than Document.
William
if you left the process as you posted it above, then there will be no results even if you use correct XPath queries. You forget connecting the inner ports of the "Cut Document" operator. If you just connect them, every single document part selected by the XPath query will be delivered to the results collection. Since every part will be treated as a single document the overall result of "Cut Document" generates a collection of documents. Operators like "Extract Content" or "Tokenize" only work on a single document. To make use of them, you can either place them inside the "Cut Document" operator to process every single document part or loop over the elements of the collection afterwards.
Regards
Matthias
I actually discovered what you said on my own and have been successful in extracting content and performing text processing. In fact, I have run the algorithm (modified from what I posted at the beginning) successfully last night on a dataset of 500 records. I think I am ready to scale it some more. Thanks to all for your help.
Best regards,
William