RapidMiner

crawling rules for "store_with_matching_content" without regular expression

SOLVED
Wisdom logo Registration now open for RapidMiner Wisdom Americas | New Orleans | October 10-12, 2018   Learn More
Learner III lukei_11
Learner III

crawling rules for "store_with_matching_content" without regular expression

Hi everyone,

 

I'm using RapidMiner Studio and I have a problem with the "store_with_matching_content" crawling rule in the "crawl web"-operator. I want to collect the contact information from several websites with different url-structures. Because of these different url-structures and the big amount of sites I use the store_with_matching_content operator to get to the contact page of each site and save it. Unfortunately the crawler saves every single site on the webpage where it finds the pattern "contact", even when it is the labeling of a link in the site structure (and not only the page with the contact information as it should).

 

So my question is: is there a way to limit the matched content to a special position in the HTML-file. That means setting up a rule like "when you find 'contact' between p- or h1-, h2-, h3-tags save the website; when you find 'contact' between a-tags don't save the website"?

 

I know how to do it with regular expressions, but the store_with_matching_content rule doesn't allow RegEx-rules but only a given term.

 

Do you have any idea how to solve this issue? I would be really grateful.

 

Thank you.

Lukei_11

1 REPLY
Highlighted
Unicorn
Unicorn
Solution

Re: crawling rules for "store_with_matching_content" without regular expression

I don't think you can do this directly using the Crawl Web operator, but you should be able to do this after you store the full pages by using either Cut Document or Extract Content inside a Process Documents from Data subprocess operator.  

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting and Training by Certified RapidMiner Experts