Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Webcrawling - problem with storing sites
Hey,
I'm working on a web crawling project to analyse various crowdfunding sites' projects via text mining in Rapidminer 5. I have already built a working text analyser, but I'm stuck at the web crawling part. The problem is that the web crawler does crawl through the requested sites, but doesn't store them. I have tried experimenting with page size, depth and the like, but still the program just skips those sites. It is probable that the problem is with my storing rules. They look like the following, when trying to crawl through Kickstarter's sites:
Follow with matching URL:
And the log looks like the following:
Thanks in advance!
Cheers
I'm working on a web crawling project to analyse various crowdfunding sites' projects via text mining in Rapidminer 5. I have already built a working text analyser, but I'm stuck at the web crawling part. The problem is that the web crawler does crawl through the requested sites, but doesn't store them. I have tried experimenting with page size, depth and the like, but still the program just skips those sites. It is probable that the problem is with my storing rules. They look like the following, when trying to crawl through Kickstarter's sites:
Follow with matching URL:
Store with matching URL:
.+kickstarter.+
An example URL that would need to be stored is:
https://www\.kickstarter\.com\/projects.+
http://www\.kickstarter\.com\/projects.+
(?i)http.*://www\.kickstarter\.com\/projects.+
(no advertising intended)
http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight
And the log looks like the following:
As you can see, it follows through with the process and just skips these links, and it doesn't even say that it doesn't match the filter rules so it's been discarded, so I'm not even sure that in these cases the program compares the links to the rules. I see a lot of links in the log preceded with ("Following link..") but very few preceded with ("Discarded page..."). Does this mean that it just checks a few pages, or just that it won't notify me for every discarded page?
Mar 12, 2014 11:50:37 AM INFO: Following link http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/post/12036057734/todays-project-of-the-day-is-bhaloidam-an-indie
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/tagged/bhaloidam
Mar 12, 2014 11:50:38 AM INFO: Discarded page "http://kickstarter.tumblr.com/post/79165806431/do-you-like-coloring-and-also-have-questions" because url does not match filter rules.
Thanks in advance!
Cheers
0
Answers
My parameters:
url: http://connect.jems.com/profiles/blog/list?tag=EMS
store with url, follow with url: .+blog.+
output directory: C:\Program Files\Rapid-I\myfiles\webcrawl
extension: html
max pages: 20
max depth: 20
domain: web
delay: 500
max threads: 2
max page size: 500
obey robot exclusion: T
The output provides the first page. This is contrary to simafore instruction and example
http://www.simafore.com/blog/bid/112223/text-mining-how-to-fine-tune-job-searches-using-web-crawling
which states that the last file is stored.
I also tried to follow vancouver blog spot
https://www.youtube.com/watch?v=zMyrw0HsREg#t=13
and duplicate the result. For all of my runs, it only shows the first page, although the log shows that it obeys and follows the follow link rule.
Any help would be greatly appreciated! I am getting really frustrated with this.
My code is below:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
<parameter key="url" value="http://connect.jems.com/profiles/blog/list?tag=EMS"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+blog.+"/>
<parameter key="follow_link_with_matching_url" value=".+blog.+"/>
</list>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Program Files\Rapid-I\myfiles\webcrawl"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="20"/>
<parameter key="max_depth" value="20"/>
<parameter key="delay" value="500"/>
<parameter key="max_threads" value="2"/>
<parameter key="max_page_size" value="500"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>