Web Crawling guide - help much needed


Web Crawling guide - help much needed

[ Edited ]

Hi there,


I am new to Rapid Miner though have a deadline coming up soon and just wanted some help with webcrawling.


I'm doing a crowdsourcing assignment where I need to 'crawl' a website to find detailed information, which I can then subject this data to further processing.  However, I am having trouble running my initial analysis.  I've downloaded both web and text mining extensions, have put in the URL to crawl, tried to add parameters where results returned match with my URL and links containing the name of the site itself.  I've followed some tutorials and specified Rapidminer to save results to a directory, in .txt format.  

I'm not sure how 'max crawl depth' translates to actually 'going through' links and pages in my given URL. I want to search through user suggestions in a crowdsourcing project, but there is no way to specify a time window of these results. I set the max dept to 400. I've selected 'add content as attribute', and to write pages to disk. I have also put in my user agent prior to running the analysis.

In one instance, I did manage to find 60 or so text files to my directory which pertained to the analysis. Whilst some of these were links I wanted, a lot weren't, and the date was too recent anyway. I wasn't sure how to further systematise my search criteria.

It is frustrating because I have a whole design set up, but no way to 1) get the data in Rapid Miner, or even 2) review the text files reliably and go through these whilst specifying I want user reviews posted from a certain date. I also don't know how I would include user metadata, such as past voting and commeting history, into the analysis, or if this is done after. All this information is available on the website itself, when you click on a given idea - the website shows how many ideas this user has submitted, how many votes and comments they've made etc. I could do this by hand, but I need hundreds if not over 1,000 different links to reliably analyse.

If anyone could provide further guidance I would be wholly appreciative. I have a deadline but not much time.



See more topics labeled with:


Re: Web Crawling guide - help much needed

Hi milkshake_luva,


Web Crawling is highly depending on the structure of the website.This makes a general answer to your problem really difficult Smiley Wink

Thus, could you please provide the URL (in the this forum or by PM) and more details about the information you want to retrieve so I can create a process myself?


Best regards,



Re: Web Crawling guide - help much needed

[ Edited ]

Hi Edin,


Thanks very much.  The assignment is looking at crowdsourcing and implementation success of 'ideators'.  I also want to look at past user (ideator ) activity - in terms of ideas submitted previously, general voting behaviour, and commenting behaviour.


The idea is to essentially built a networked or 'bundled' state of creativity and subject this to a test - was the idea useful, and implemented by an organisation, or not.  The website, by the way, is Dell's IdeaStorm, where implementation data is publicly available for each idea.   I have software for subjecting user info to sentiment analysis already - I just need the text itself, as well as, hopefully, some kind of organisation to this text.  On that point, I'd also like (in terms of a time window) all ideas within the most recent 4 months not included.  So maybe user activity from http://www.ideastorm.com/ in the way of votes, with metadata about prior activity, between say summer 2013-summer 2015.


Id love to be able to do is stuff myself; I've been reading the RapidMiner manual and it would be great to get some practice. Just my deadline is not far away at all, so it's a matter of need above it else.


I hope that was sufficient information for you - I've made this public should any other contributors have success tips for me.  





Re: Web Crawling guide - help much needed

This might get you started :


It's taking one page, looking at the content and storing the content of interest in an exampleset for further analysis.


What you still need to do is setup the actual crawl logic, and modify where needed if you want more / less / other data from teh page but the principal remains the same.


<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
  <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="get page" width="90" x="45" y="34">
        <parameter key="url" value="http://www.ideastorm.com/idea2ExploreMore?v=1487149161389&amp;Type=TrendingIdeas"/>
        <parameter key="random_user_agent" value="true"/>
        <parameter key="connection_timeout" value="50000"/>
        <parameter key="read_timeout" value="50000"/>
        <parameter key="accept_cookies" value="all"/>
        <list key="query_parameters"/>
        <list key="request_properties"/>
      <operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="340">
        <parameter key="text" value="&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&#10;&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;&gt;&#10;&lt;xsl:output method=&quot;xml&quot; version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; indent=&quot;yes&quot;/&gt;&#10;&lt;xsl:template match=&quot;/&quot;&gt;&#10;&lt;root&gt;&#10;&lt;xsl:for-each select=&quot;//article[@class='search-result']&quot;&gt;&#10;&lt;xsl:variable name=&quot;idea&quot; select=&quot;h3/a&quot;/&gt;&#10;&lt;xsl:variable name=&quot;added&quot; select=&quot;div[1]/p[@class='date']&quot;/&gt;&#10;&lt;xsl:variable name=&quot;votes&quot; select=&quot;div[1]/p[@class='votes']/em&quot;/&gt;&#10;&lt;xsl:variable name=&quot;details&quot; select=&quot;normalize-space(p[@class='truncatedBody'])&quot;/&gt;&#10;&lt;row idea=&quot;{$idea}&quot; added=&quot;{$added}&quot; votes=&quot;{$votes}&quot; details=&quot;{$details}&quot;/&gt;&#10;&lt;/xsl:for-each&gt;&#10;&lt;/root&gt;&#10;&lt;/xsl:template&gt;&#10;&lt;/xsl:stylesheet&gt;"/>
      <operator activated="true" class="text:html_to_xml" compatibility="7.3.000" expanded="true" height="68" name="HTML to XML" width="90" x="179" y="34"/>
      <operator activated="true" class="text:replace_tokens" compatibility="7.3.000" expanded="true" height="68" name="Replace Tokens" width="90" x="313" y="34">
        <list key="replace_dictionary">
          <parameter key="&lt;html[^&gt;]+&gt;" value="&lt;html&gt;"/>
          <parameter key="(?sm)&lt;!DOCTYPE[^&gt;]+&gt;(.)" value="$1"/>
      <operator activated="true" class="text:combine_documents" compatibility="7.3.000" expanded="true" height="82" name="Combine Documents" width="90" x="45" y="238"/>
      <operator activated="true" class="text:process_xslt" compatibility="7.3.000" expanded="true" height="82" name="Process Xslt" width="90" x="179" y="238"/>
      <operator activated="true" class="text:cut_document" compatibility="7.3.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="646">
        <parameter key="query_type" value="Regular Region"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries">
          <parameter key="row" value="&lt;row./&gt;"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="7.3.000" expanded="true" height="68" name="Extract Information" width="90" x="112" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="idea" value="//@idea"/>
              <parameter key="added" value="//@added"/>
              <parameter key="votes" value="//@votes"/>
              <parameter key="details" value="//@details"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          <connect from_port="segment" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
      <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="646">
        <parameter key="text_attribute" value="pages"/>
      <operator activated="true" class="select_attributes" compatibility="7.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="646">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="pages|query_key"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      <connect from_op="get page" from_port="output" to_op="HTML to XML" to_port="document"/>
      <connect from_op="Create Document" from_port="output" to_op="Process Xslt" to_port="xslt document"/>
      <connect from_op="HTML to XML" from_port="document" to_op="Replace Tokens" to_port="document"/>
      <connect from_op="Replace Tokens" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
      <connect from_op="Combine Documents" from_port="document" to_op="Process Xslt" to_port="document"/>
      <connect from_op="Process Xslt" from_port="document" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>