Options

Website-Content into one cell

ds139ds139 Member Posts: 1 Contributor I
edited November 2018 in Help

Hello everyone,

I want to use textmining methods on the lyrics of a website.

What I have now is:

                                                                               

 Artist  Song  Lyrics
 The Killers   Mr. Brightside   http://lyrics.html 

 

What I do want is:                                                                                 

 Artist  Song  Lyrics
 The Killers   Mr. Brightside    Coming out of my cage and I'm doing just fine... 

 

You know what I mean?  The Lyrics are written within a <p></p> and I want the whole string into one single cell - 

I do know, that I need "Retrieve", "Get Pages" and "Process Documents to Data" (inside: "Extract Content", and the I don't know any further,...)

 

Which Operator manages it, that the content within the <p> is put into one cell

I hope someone can help me, because I need the Lyrics for further processings

Thank you

 

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I think you want "Cut Document" rather than (or in addition to) "Extract Content" in this case.  After you have retrieved the pages using "Get Pages" and then created your text documents using "Data to Documents" you can use Cut Document and then specify the region of the html that you want to extract using either Xpath (if the lyrics are in a named element) or some kind of regex query.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.