RapidMiner

Using regex for exclusion of facebook share url in web crawler operator

Contributor

Using regex for exclusion of facebook share url in web crawler operator

Hi,
I want to crawl the site:

https://www.bmwgroup.com/en.html
And some of other companies.

Therfore I used the regex .+bmwgroup.+en.+
I use "en" because I just want to crawl the sites in english language and intentionally not "/en" because some sites include the en without a /.
The problem is that the crawler crawls all social media share links, too. And thus the process of crawling lasts like forever because the share links of facebook and co including the regex too.
How can I exclude facebook, linkedin, twitter and co?
I tried something like .+(?!facebook)bmwgroup.+en.+ but unsuccessful.
You have any ideas. Additionally I have to say I can't use a regex like: https\:\/\/www\.bmwgroup.+en.+ to avoid to crawl any sites not starting with https://www.bmwgroup, because other links in this site are just http or beginn with http://w3.bmwgroup and so these site would be ignored. But I want to crawl all links but not socialmedia links.
Could you please help?

See more topics labeled with:

1 REPLY
Highlighted
Elite

Re: Using regex for exclusion of facebook share url in web crawler operator

You probably do need to get the start part correctly, so try something like 

https?:\/\/(www|w3)\.bmwgroup.+en.+

 

This will allow you to crawl both http and https for www and w3, then followed by bmwgroup. You will avoid that different domains get crawled this way, while the one of interest are crawled.