RapidMiner

Newbie lukei_11
Newbie

crawling rules for "store_with_matching_content" without regular expression

Hi everyone,

 

I'm using RapidMiner Studio and I have a problem with the "store_with_matching_content" crawling rule in the "crawl web"-operator. I want to collect the contact information from several websites with different url-structures. Because of these different url-structures and the big amount of sites I use the store_with_matching_content operator to get to the contact page of each site and save it. Unfortunately the crawler saves every single site on the webpage where it finds the pattern "contact", even when it is the labeling of a link in the site structure (and not only the page with the contact information as it should).

 

So my question is: is there a way to limit the matched content to a special position in the HTML-file. That means setting up a rule like "when you find 'contact' between p- or h1-, h2-, h3-tags save the website; when you find 'contact' between a-tags don't save the website"?

 

I know how to do it with regular expressions, but the store_with_matching_content rule doesn't allow RegEx-rules but only a given term.

 

Do you have any idea how to solve this issue? I would be really grateful.

 

Thank you.

Lukei_11

1 REPLY
Highlighted
RM Certified Expert
RM Certified Expert

Re: crawling rules for "store_with_matching_content" without regular expression

I don't think you can do this directly using the Crawl Web operator, but you should be able to do this after you store the full pages by using either Cut Document or Extract Content inside a Process Documents from Data subprocess operator.  

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Twitter Feed