The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Extract Information

carlcarl Member Posts: 30 Maven
edited November 2018 in Help

Hi - I'd like to try extract only the company names from this web page https://www.digitalmarketplace.service.gov.uk/g-cloud/search?q=, i.e. the second piece of text in each block.  Is Process Documents from the Web, and Extract Information, the most efficient way to do this?  And I'm new to Rapidminer and XPath, and wondered if anyone could advise the right XPath query expression to extract only the company name?

Best Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    I can't help with the XPath query (I find XPath to be very finicky) but the attached process using simple string matching should do the trick.  You can also do it with RegEx if you prefer.

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Yes, because get pages returns an exampleset rather than a document collection, you may need to add nominal to text and extract document operators as well, and put the subsequent processing inside a loop to iterate over multiple pages.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • carlcarl Member Posts: 30 Maven

    Fantastic!  Thank you for sending the sample process.  So much easier with the string approach.

     

    If I wanted to do that for multiple pages, would I replace Get Page with Read Excel + Get Pages?  Seem to then get an error when connecting to Cut Document.

  • carlcarl Member Posts: 30 Maven

    Thank you.  I struggled with the loop approach (will save that for when I've got more adept with Rapidminer), but I have got a working solution by using Read Excel > Get Pages > Data to Documents > Combine Documents > Cut Document.  

Sign In or Register to comment.