"Using Crawl Web Operator"

QuornQuorn Member Posts: 3 Contributor I
edited May 2019 in Help
Hello,

I have been very impressed with Rapidminer and I can see how powerful it is.

The difficulty I have with it - and I can see this is not unusual - is that I do not have the technical to use it fully, nor do I have even a basic understanding of the terms you use to explain how to use it.

At the moment I am mainly interested in the Web Mining and Text Processing operators.

I have tried using Crawl Web, and my attempt was successful. But of course, if I allow the depth to be more than about 2 I begin to crawl all sorts of sites I am not interested in so I need to restrict it.

Unfortunately I do not understand how to apply the rules.

store_with_matching_url
store_with_matching_content
follow_link_with_matching_url
follow_link_with_matching_text

I have tried playing with each of them but when I do I get no results.

For example, I want to crawl the business listings site http://www.domainname.com
So I set that as the URL parameter, and I open the Crawling Rules.

I set: follow_link_with_matching_url with the value http://www.domainname.com because I only want to follow onsite links. But when I do that and press 'Run' it goes to the http://www.domainname.com address and finishes.

So I tried using the 'set regular expression' dialog box, and added a variety of constructs and shortcuts suggested there. But each time I get no results. I tried all kinds of different arrangements including http://www.domainname.com* http://www.domainname.com/* and tried the period and most of the others in some arrangement or other. But never got any results for any.

So I am left with using the Crawl Web operator at about 2 or 3 depth, and then extracting the relevant URLs, then searching again on each of them to get to the necessary depth. This is proving very slow and laborious and I am certain that all I need is for some one to say - use this rule = xx or whatever, and I'll be able to use it properly.

I can see this software wasn't created for beginners. It is clear that a user really needs to understand the technical language you use (I've read your manual all the way through and watched the vids I can understand, but in everything there is a basic assumption of knowledge that I simply don't have).

A simple step by step for each of the operators, just to get them functioning, would be really useful here. But first of all, can someone please tell me what rule I need to type so I don't crawl the whole web?

Thanks,
Tagged:

Answers

  • colocolo Member Posts: 236 Maven
    Hi Quorn,

    you can avoid leaving a special domain via the "domain" parameter (set it to server or subtree). The crawling rules allow additional constraints but you have to use regular expressions here. The expressions you posted don't make much sense. If you want to allow any characters you have to use .* instead of the asterisk alone (as known from wildcards). Regarding to your examples one possible expression could read as follows:
    http://www\.domainname\.com.*
    (you should escape all 'real' dots and other meta characters with a backslash). In most cases follow_link_with_matching_url should do it as additional constraint, checking the whole content via store_with_matching_content can really take a long time.

    It seems there is some issue with the crawling rules in the current version, so I cannot check the behavior right now.

    Regards,
    Matthias
  • QuornQuorn Member Posts: 3 Contributor I
    Thank you Matthias, I am very grateful for your post.
    The domain attribute does exactly what I needed.

    I am really very pleased!

    You discuss the crawling rules, also - but I don't quite understand what you are saying.
    In fact, I am not clear about what a 'regular expresion' is. I thought it was a text string but it looks like I am wrong.

    You write:
    "If you want to allow any characters you have to use .*
    (that looks like a period followed by an asterisk)

    Can you tell me how this would affect my results? For example, if I am setting the URL attribute as main domainnamedotcom, and I set the Domain attribute as Server, would the results be different in any way if I were to add .* to the crawling rules?

    You go on to suggest:
    http://www\.domainname\.com.*
    is this an alternative to using simply .* or would it produce different results?

    To me, this looks like a repeat of setting the URL to the main domain and setting Domain as 'tree'. Perhaps you could just explain in simple terms what this would do?

    I am asking additional questions now, so I just want to say thanks for helping with my first question. It is a simple thing but has helped considerably. I have now got a copy of the book 'Data Mining Practical Machine Learning Tools & Techniques' so I can begin to learn a little more about this topic - although I fear that I will really only get started when I have also taught myself Java.

    Thanks again.
  • colocolo Member Posts: 236 Maven
    Hello again,

    I'm glad that I could provide some help. I will try to go on with that ;)

    Regular expressions are more than simple text strings. The basic concept is quite simple but regular expressions are very powerful and can grow to a really confusing dimension. They allow to match characters or strings from a given text. The dot for example matches any character and the asterisk is a quantifier, that allows the preceding character to be present many times (or not to be present) in this place to bring up a match.

    For example the regular expression (regex) h.llo would match hello as well as hallo but not heello. The regex hell.* would match hello, hellooo or hell

    For more information you should refer to the broad spread sources (a few examples: http://en.wikipedia.org/wiki/Regular_expression, http://www.regular-expressions.info/, http://download.oracle.com/javase/tutorial/essential/regex/index.html).


    As you already assumed, the expression http://www\.domainname\.com.* would exactly do the same as the subtree option for the domain parameter does. This would allow all links that begin with the respective URL and are followed by nothing or something. I would prefer using the domain option but I just wanted to give a clue on writing a regex with some sense for your intention. If you would simply use .* this would allow all links that are found (leaving out this rule does the same, as far as I know).

    Hope this helps to clarify confusion that may have rised due to my last post a bit and I wish you a good start in your data miner career :)

    Regards,
    Matthias
  • QuornQuorn Member Posts: 3 Contributor I
    Hello Matthias,

    many thanks again. You have succeeded in explaining this matter very clearly and I thank you very much for doing so. I now understand what you meant in your first post AND your second. That is great progress for me!

    And thank you for outlining and example of the affects of the .* rule. That was very useful indeed.

    Two further clarification, if you are able: - if I wanted to search a domain for, eg, the Playstation 3. But I wanted to ensure that I 'caught' both the terms Playstation 3 and PS3, would I create two URL matches one for each, or would I create just one with two rules?

    Secondly, what if I wanted to search domainnamedotcom but I only wanted to crawl the pages under the playstation category - would I put domainnamedotcom/playstation as attribute in the URL field,
    or would I instead leave the URL attribute as domainnamedotcom and create a rule such as
    domainnamedotcom/playstation.*

    or simply:
    playstation.* or playstation

    Thank you again for your great help in all these matters.
  • colocolo Member Posts: 236 Maven
    Hi Quorn,

    please excuse the late response, couldn't visit here the last week. This seems to be our private thread, btw ;)

    In general, if you want to check some terms that might occur in the same place, regular expressions allow to use the vertical bar | as OR-operator. If you want to match both 'Playstation 3' and 'PS3' you could write (Playstation 3|PS3). Just put the desired terms inside parentheses and put the vbar between them.

    In your second question you refer to the playstation category. Does the website use the folder notation in URLs to show categories? For example somedomain.com/category or might somedomain/category_x_some_topic.html or similar also occur?
    In the first case you could use domain.com/playstation as entry URL for the crawler (if this is resolved to a valid document). But even there you will find links leaving this category, so an additional crawling rule is necessary anyway. How the rule should look depends on the structure of the crawled site. If everything about playstation is arranged in the domain.com/playstation/... substructure you could use the suggested rule domain.com/playstation.* (this would allow anything following the category name). If you want to capture all pages containing 'playstation' in their URL you could use .*playstation.* or .*(Playstation|PS3).* if you want to allow different terms. If you put (?i) at the start of a rule this will check the terms case-insensitive (if you write Playstation it will also match playstation - without this option it would not).

    Hope this again helps a bit - if you didn't find this out by yourself in the meantime.

    Regards,
    Matthias
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    how to write the web crawling rules

    there is two parts one is rule application and other is rule value can u tell me how to write rule value

Sign In or Register to comment.