Options

How to use fields created from xpath extraction?

luiz_vidalluiz_vidal Member Posts: 14 Contributor II
edited November 2018 in Help

Guys,

Again I need your help here. It's seems simple to build it but I didn't figure out yet how.

 

First of all, to give a context to it, I crawled some web pages and extract exact piece of information I wanted via xpath.

As you can see below Preço and Descrição were extracted from the webpage I crawled.

Crawl.JPG

Now I want to use both fields for my data mining process, such as k-NN, random forest, etc.

But right after my operator that process documents from files I want to set role, nominal to text, etc... but I don't see these fields mentioned above.

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="112" y="34">
<parameter key="iteration_macro" value="page"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="238">
<parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="cerveja"/>
</list>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja\"/>
</operator>
<operator activated="true" class="rename_file" compatibility="8.0.001" expanded="true" height="82" name="Rename File" width="90" x="514" y="238">
<parameter key="file" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja\0.txt"/>
<parameter key="new_name" value="%{page}.txt"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="112" y="34">
<list key="function_descriptions">
<parameter key="page" value="%{page} + 1"/>
</list>
</operator>
<connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Crawl Web" from_port="example set" to_op="Rename File" to_port="through 1"/>
<connect from_op="Rename File" from_port="through 1" to_port="output 2"/>
<connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="112" y="238">
<list key="text_directories">
<parameter key="cerveja" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja"/>
</list>
<parameter key="encoding" value="ISO-8859-1"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Div" value="(//*[@data-trackcheckoutcontainer=&amp;quot;true&quot;])"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Price and Name" width="90" x="380" y="85">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Preço" value="//*[@name=&amp;quot;priceProduct&quot;]/@value"/&gt;
<parameter key="Descrição" value="//*[@name=&amp;quot;productName&quot;]/@value"/&gt;
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="segment" to_op="Extract Price and Name" to_port="document"/>
<connect from_op="Extract Price and Name" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="340">
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="false" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="447" y="340">
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="false" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="581" y="340">
<parameter key="attribute_name" value="metadata_path"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="false" class="concurrency:cross_validation" compatibility="8.0.001" expanded="true" height="145" name="Cross Validation" width="90" x="715" y="340">
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="8.0.001" expanded="true" height="82" name="k-NN" width="90" x="179" y="34"/>
<connect from_port="training set" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Would someone give me a little help?

Thanks in advance.

Best Answer

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    Just toggle on the "Keep Text" in the second Process Documents operator. 

Answers

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    From your screenshot, I'm assuming that this is the crawled content going into the Process Documents from data operator. Will your label attribute remain the same or will you set it to the Preço attribute?

     

    Either way, you'll need to use a Set Role operator to transform Preço and Descrição into regular attributes and then Nominal to Text them before you can use them in the Process Documents operator.

  • Options
    luiz_vidalluiz_vidal Member Posts: 14 Contributor II

    Otto,

     

    This is exactly what I tried, but after the process documents form files operator, I do not see Preço and Descrição as fields in order to set a role for them. Look below :

     

    SetRole.jpg

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You probably don't see them because the meta data isn't propogating through. This could be for several reasons but you can just type the name of the attribute you want.

     

    If your label is meant to be the Preco field, then the following process works.

     

    Update: Just make sure to select the Descaro field in the Nom to Text operator, I left it as 'All' accidently. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="45" y="34">
    <parameter key="iteration_macro" value="page"/>
    <process expanded="true">
    <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="136">
    <parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
    <list key="crawling_rules">
    <parameter key="follow_link_with_matching_url" value="cerveja"/>
    </list>
    <parameter key="retrieve_as_html" value="true"/>
    <parameter key="add_content_as_attribute" value="true"/>
    <parameter key="write_pages_to_disk" value="true"/>
    <parameter key="output_dir" value="C:\temp"/>
    </operator>
    <operator activated="false" class="rename_file" compatibility="8.0.001" expanded="true" height="68" name="Rename File" width="90" x="447" y="136">
    <parameter key="file" value="C:\temp\0.txt"/>
    <parameter key="new_name" value="%{page}.txt"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="246" y="34">
    <list key="function_descriptions">
    <parameter key="page" value="%{page} + 1"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Crawl Web" from_port="example set" to_port="output 2"/>
    <connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="8.0.001" expanded="true" height="82" name="Append" width="90" x="179" y="34"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
    <parameter key="create_word_vector" value="false"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:cut_document" compatibility="7.5.000" expanded="true" height="68" name="Cut Document (2)" width="90" x="179" y="34">
    <parameter key="query_type" value="XPath"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries">
    <parameter key="Div" value="(//*[@data-trackcheckoutcontainer=&amp;quot;true&quot;])"/>
    </list>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    <process expanded="true">
    <operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Price and Name (2)" width="90" x="380" y="34">
    <parameter key="query_type" value="XPath"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries">
    <parameter key="Preço" value="//*[@name=&amp;quot;priceProduct&quot;]/@value"/&gt;
    <parameter key="Descrição" value="//*[@name=&amp;quot;productName&quot;]/@value"/&gt;
    </list>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_port="segment" to_op="Extract Price and Name (2)" to_port="document"/>
    <connect from_op="Extract Price and Name (2)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="document" to_op="Cut Document (2)" to_port="document"/>
    <connect from_op="Cut Document (2)" from_port="documents" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="parse_numbers" compatibility="8.0.001" expanded="true" height="82" name="Parse Numbers" width="90" x="447" y="187">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Preço"/>
    </operator>
    <operator activated="false" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="313" y="187">
    <list key="text_directories">
    <parameter key="cerveja" value="C:\temp"/>
    </list>
    <parameter key="encoding" value="ISO-8859-1"/>
    <parameter key="create_word_vector" value="false"/>
    <process expanded="true">
    <operator activated="true" class="text:cut_document" compatibility="7.5.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="34">
    <parameter key="query_type" value="XPath"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries">
    <parameter key="Div" value="(//*[@data-trackcheckoutcontainer=&amp;quot;true&quot;])"/>
    </list>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    <process expanded="true">
    <operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Price and Name" width="90" x="380" y="34">
    <parameter key="query_type" value="XPath"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries">
    <parameter key="Preço" value="//*[@name=&amp;quot;priceProduct&quot;]/@value"/&gt;
    <parameter key="Descrição" value="//*[@name=&amp;quot;productName&quot;]/@value"/&gt;
    </list>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_port="segment" to_op="Extract Price and Name" to_port="document"/>
    <connect from_op="Extract Price and Name" from_port="document" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="document" to_op="Cut Document" to_port="document"/>
    <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="false" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="340">
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
    <parameter key="attribute_name" value="Preço"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Descrição"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="782" y="34">
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="8.0.001" expanded="true" height="145" name="Cross Validation" width="90" x="916" y="34">
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="8.0.001" expanded="true" height="82" name="k-NN" width="90" x="179" y="34">
    <parameter key="k" value="10"/>
    </operator>
    <connect from_port="training set" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <connect from_op="Performance" from_port="example set" to_port="test set results"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Loop" from_port="output 2" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Parse Numbers" to_port="example set input"/>
    <connect from_op="Parse Numbers" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
    <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

     

     

  • Options
    luiz_vidalluiz_vidal Member Posts: 14 Contributor II

    Thomas, 

     

    Thanks for taking your time to analyze it. I notice that your process is slightly different from mine.

     

    I am having some trouble though as I need to 'see' those fields as actually fields because I want to get the substring of the description of the product(descrição).

    When I manually set the roles then if I want to generate a new attribute from a substring of description, let's say, it doesnt work because the field is not recognized.. =(

    Is there a workaround for this?

    Thanks

     

  • Options
    luiz_vidalluiz_vidal Member Posts: 14 Contributor II

    Thomas, 

     

    Yeah that worked out, thanks a lot for your help!!

Sign In or Register to comment.