The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

Web Mining

newbierapidnewbierapid Member Posts: 6 Contributor II
edited November 2018 in Help
Hai All,

I am new to RM. Currently I am using RM5.0 version. My objective is to crawl web(Using Crawling operator) and I am able to save the URLs by giving the regular expression rules into an excel file.Now the problem is I am not able to see the content related to each URLs. After geting the content  I have to eliminate html contents in each page.

Can anyone suggest how to proceed further. It will be great if someone can explain with operator names in process order.

Thanks

Answers

  • Options
    colocolo Member Posts: 236 Maven
    Hi,

    I'm not really sure where the problem lies, since the description is a bit vague. You are using the "Crawl Web" operator and get URLs but no contents? then use the "add pages as attribute" parameter and you will get both. But I have no clue how regular expressions and an Excel file should be related to this... Perhaps you might provide some more details about what you have done (perhaps post your process XML) and where you couldn't achieve further goals.

    Regards
    Matthias
  • Options
    newbierapidnewbierapid Member Posts: 6 Contributor II
    Hi Mathias,

    Sorry for less informatino regarding this. I have used Crawler operator to crawl a website. I have followed the way you suggested, Now I am able to get the URLs listed. I would like to see the content in each url ,kindly excuse if its a silly question. After geting the content I have to remove each tags in that page and do further processing .Here I am posting my XML code.

    Thanks in advance

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
        <process expanded="true" height="605" width="692">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.003" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
            <parameter key="url" value="http://www.asklaila.com/search/Bangalore/-/shopping malls/?searchNearby=false&amp;amp;ac=true"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_text" value=".*Shopping Malls.*"/>
            </list>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Documents and Settings\Sudheendra\Desktop\b"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="4"/>
            <parameter key="user_agent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; .NET CLR 1.1.4322)"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • Options
    colocolo Member Posts: 236 Maven
    Hi,

    sorry I'm a bit confused... Link and website content are already there (the latter is contained in the attribute Page). If you want to get rid of the HTML markup you might do something like this:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
       <process expanded="true" height="605" width="692">
         <operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
           <parameter key="url" value="http://www.asklaila.com/search/Bangalore/-/shopping malls/?searchNearby=false&amp;amp;ac=true"/>
           <list key="crawling_rules">
             <parameter key="follow_link_with_matching_text" value=".*Shopping Malls.*"/>
           </list>
           <parameter key="write_pages_into_files" value="false"/>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="output_dir" value="C:\Documents and Settings\Sudheendra\Desktop\b"/>
           <parameter key="extension" value="html"/>
           <parameter key="max_pages" value="4"/>
           <parameter key="user_agent" value="Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; .NET CLR 1.1.4322)"/>
         </operator>
         <operator activated="true" class="text:process_document_from_data" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="30">
           <parameter key="create_word_vector" value="false"/>
           <parameter key="keep_text" value="true"/>
           <list key="specify_weights"/>
           <process expanded="true" height="607" width="763">
             <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
             <connect from_port="document" to_op="Extract Content" to_port="document"/>
             <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
         <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    Regards
    Matthias
  • Options
    newbierapidnewbierapid Member Posts: 6 Contributor II
    Thanks Mathias
Sign In or Register to comment.