"embedded crawler (websphinx) and RegEx"

tsschmidttsschmidt Member Posts: 2 Contributor I
edited May 2019 in Help
(How) can I use RegEx within that crawler? It did not work...

I tried this several times as follows (see also attachement):
visit_content: ^water$
or
visit_content: \<water\>
or
visit_content: (?s)\<water\>
...

(I don't want waterfall...)

Please don't suggest HTTRACK. As far as I know HTTRACK can not filter the content of pages but only URLs.

[attachment deleted by admin]
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    the crawler does not support regular expressions. This are the only condition types are supported to specify which links to follow:
    follow_url A link is only followed, if the target URL contains all terms stated in this parameter.
    link_text A link is only followed, if the link text contains all terms stated in this parameter.

    The conditions that state whether to store a page or not allow for the following expressions:
    visit_url A page is only stored if its URL contains all terms stated in this parameter.
    visit_content A page is only stored if its content contains all terms stated in this parameter.

    Further informations could be found on http://nemoz.org/joomla/content/view/64/53/lang,de/

    Greetings,
    ย  Sebastian
Sign In or Register to comment.