"[URGENT] RE: Web Mining/Crawl Web"

majorproject · July 2011

Hello,

I have been very impressed with Rapid miner ever since i came across this program. So far, my experience with Rapid miner was superb and things did move on smoothly, not until i was stuck with this problem......

i'm currently working on an assignment, whereby we have to analyst the trend in the IT working industry(e.g. IT jobs that are highly in demand within the working industry).

i have read through some of the post on the forum but couldn't seem to find the answers to my question. The problem is that i wasnt able to crawl all the jobs that are IT related from job recruiting websites such as www.jobstreet.com.sg.

With these parameters set, the maximum amount of pages i could crawl would be roughly around 100 txt files with the source code in it, but most of the time only 60 of the them are what im looking for which are job information(e.g. http://www.jobstreet.com.sg/jobs/2011/7/a/20/2666559.htm?fr=J). The rest would be search result(e.g. http://job-search.jobstreet.com.sg/singapore/job-opening.php?area=1&;option=1&specialization=192&job-posted=0&src=19&sort=1&order=0&classified=1&job-source=64&src=19&srcr=2).

This is how i set the parameters.

For the depth, ive set it as: 2

I've set the URL as: http://job-search.jobstreet.com.sg/singapore/computer-information-technology-jobs/

i've also set the rule as:

store_with_matching_content: .*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*

follow_link_with_matching_text: .*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*

This will be the Xml codes:

<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="190" width="145">
<operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="75">
<parameter key="url" value="http://job-search.jobstreet.com.sg/singapore/computer-information-technology-jobs/"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_text" value=".*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*"/>
<parameter key="store_with_matching_content" value=".*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*"/>
</list>
<parameter key="output_dir" value="C:\Users\student\Desktop\CRAWLED RESULT"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

it would be awesome if anyone could share their knowledge with us and give us a few tips or provide us with a step by step guide on the ways to obtain the information we want. Would very much appreciate the help given.

Thanks,

Miguel_B_scher · July 2011

Hi.

If I am right you want to get all the jobs (sites) that are listened in http://job-search.jobstreet.com.sg/singapore/computer-information-technology-jobs/.
Meaning also all jobs in the next XXX sites also.

If you really want to get all sites and jobs you will have to use more operators and your project will get more complex.
I would make it this way:

First of all you will have to get all pages with job links meaning all pages that are also listened on the bottom of the page (1,2,3,4... Next).
With a Loop operator that checks all the links like:
http://job-search.jobstreet.com.sg/singapore/job-opening.php?area=1&;option=1&specialization=191%2C192%2C193&job-source=1%2C64&classified=1&job-posted=0&sort=1&order=0&pg=1&src=16&srcr=33

increasing the number of

http://job-search.jobstreet.com.sg/singapore/job-opening.php?area=1&;option=1&specialization=191%2C192%2C193&job-source=1%2C64&classified=1&job-posted=0&sort=1&order=0&pg=1&src=16&srcr=33

this should be very easy. You just have to read out the total numer of jobs (currently 3,137) with an Xpath command or regular expression, divide it by 20, and round it up. After that you will have the last site number. 157 in our case.
So you will have all sites with jobs in it.
After that you can make a regular expression using the "Cut Document" operator to get all job links. Thats a far better method than searching links by words. You can be pretty sure to get only links with real jobs in it and not any advertisement etc.
You just can me a regular expression that searches for links like:
http://www.jobstreet.com.sg/jobs/2011/ in the sites that you crawled before.

After that you should have all job links of all sites that you can save or crawl directly.

It will be almost impossible to just get all your sites with all "real" job links be only using the Get Page operator.
So take a look to the operatores "Process Documents", "Loop Examples", "Cut Document" etc.

Hope I could help. If something is not clear just ask

Cheers
Miguel

majorproject · July 2011

Hi Miguel,

Thanks for the almost immediate help, really appreciate it.

I'm new to this and m not really sure what to do with the operators. Could you kindly give me a few examples on how to use the operators as i cant seem to find tutorials on the net regarding the operators such as the Process Documents", "Loop Examples" and "Cut Document".

firstly, you said that i could list out all the pages with job links and i'm suppose to check the links.
so, am i right to assume that I'm suppose to use the loop examples operator and set the iteration macro attribute as: pg= or pg=157
If im not wrong this will allow the crawler to go through each and every page with the available jobs, am i right to say that?

secondly would be, how does cut document work and how you used it? If its possible could you give me a short example of how i am suppose to use it.

I've tried and have set the attribute name as: Requirements
and i've also tried setting the query expression as: <p align='justify'> and:</p>
If my guess is correct it should take in everything that is within the open and closing tags thus giving me only text within the requirement tags.

the next question would be, am i suppose to use the 'process documents from web', 'process documents' or is it alright that i continue with the use of the 'web crawl' operator. I'm seriously lost here. Because from what i can see 'process documents from web' seems like a mixture of both the 'process documents' and 'web crawl' operators. And if there's a need to use process documents, can you please advice me on how it works.

lastly, you mentioned "You just can me a regular expression that searches for links like:
http://www.jobstreet.com.sg/jobs/2011/ in the sites that you crawled before."

So is it right of me to say that i can now set the URL as ' http://www.jobstreet.com.sg/jobs/2011/ ' instead of ' http://job-search.jobstreet.com.sg/singapore/computer-information-technology-jobs/. ' .

Thanks for the replay, and you were a great help. More than anything i could have asked of. ;D
Looking forward to your reply.

Grateful
majorproject

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"[URGENT] RE: Web Mining/Crawl Web"

Answers