Options

Extracting data From WEb pages

sijusonysijusony Member Posts: 5 Contributor II
hi,
     I am trying to extract data from HTML pages . I tried with both Regular expressions and Xpath queries .

                                             I was ,able to extract some details by using Xpath queries, but since the html page from which i am extracting is so complex ,that i am not able to make out the tag hierarchy.So its very diffficult to  specify the XPath queries , for all the data

                          Is there any other method to find out the hierarchy of the html , so that i can extract the details using Xpath queries.

regards,
siju sony mathew

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you might try to solve your problems by using tags with certain attribute values as anchors for your xpath querry. For example div tags with a class, id or name attribute.
    For easier orientation in the DOM tree, you could use a DOM explorer available for every browser. It shows the DOM tree in a explorer like fashion, making orientation easier. Some even support selection of tags by clicking in the according area of the web page itself.

    Greetings,
      Sebastian
  • Options
    sijusonysijusony Member Posts: 5 Contributor II
    hi,

            Thankyou for your suggestion ,I was able to extract data from some intranet RSS feeds.
    But i am having 2 problems now
                  1)With the user agent i am using ( ie the rapid miners default user agent), i am not able to crawl internet  rss feeds.Is there any user agent by which we wud be able to crawl sites....I am trying to crawl www.ndtv.com, but i am not able to do the same with the rapid mminers default user agent.........Is there any method to find out which user agent is being supported by a website.
                  2)If the webpage is not having wellformed HTML format, is there any way to extract the data as , xpath queries would work only with wellformed HTML pages

    greetings,
    Siiju
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Siju,
    most sites should support one of the most common browsers, especially the Internet Explorer. If this does not work, the site might exclude crawlers in the robots.txt
    If XPath does not work, you could use Regular Expressions for specifying interesting regions.

    Greetings,
      Sebastian
Sign In or Register to comment.