Options

Web page selection.

ratheesanratheesan Member Posts: 68 Maven
Hi,
How can I select the contents of a particular web page  using RM.I tried it with crawler,but getting more  pages than I specified.

Thanks,
Ratheesan

Answers

  • Options
    fischerfischer Member Posts: 439 Maven
    Hi,

    the question is unclear. What exactly do you mean by "contents"? Do you want only a specific (list of) web pages? Do you want to extract information from the Web page?
    Please specify?

    Cheers,
    Simon
  • Options
    ratheesanratheesan Member Posts: 68 Maven
    Hi Simon,
    I want to extract information from web page.If I can copy the contents in the web page as a text file,then I will apply text mining algorithms.So now I need to copy the web page in to a text file.

    Thanks
    Ratheesan.
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I guess you might change the "max_depth" parameter to zero. The crawler shouldn't then follow any links.

    With RapidMiner 5 there will soon be a web mining extension making this more easily.

    Greetings,
    Sebastian
  • Options
    ratheesanratheesan Member Posts: 68 Maven
    Hai,

    I have tried with the above method and I saved it as a text file. The saved text contains html tags and image url's etc... Is there any way to save only the texts (the text that is seen by a user when he opens a web page).

    Thanks,
    Ratheesan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    with 5.0 this would be easy, in 4.x you can only set the TextInput to contenttype html, so that all tags are filtered out.

    Greetings,
      Sebastian
Sign In or Register to comment.