The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

Input file format for Process Documents From File operator

ccrichaccricha Member Posts: 9 Contributor II
edited November 2018 in Help

Does anyone know what text structure is expected or can be parsed using the Process Documents from Files operator? I am working on Ch 15 of the book written by Markus Hofmann and Ralf Klinkenberg. They use the Process Documents from Files operator to loop over a bunch of text files containing hotel rating data. An entry for a single hotel looks like this:

 

<Author>everywhereman2
<Content>Truncated for brevity....
<Date>Jan 6, 2009
<Rating>5 5 5 5 5 5 5 5

 

What irks me is that there absolutely nothing in the documentation for this operator telling me that is an acceptable text structure that can be parsed. Does anyone happen to know more about this operator?

 

Best Answers

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    The Text Processing extension is a bit sparse on operator reference. 


    What I would do is review the Text Analytics KB and watch these videos on how to properly load/parse text data and build models from it.

     

    I will be recording a very detailed and updated Text Mining in RapidMiner video over the next few weeks.

  • Options
    ccrichaccricha Member Posts: 9 Contributor II
    Solution Accepted

    Are there plans to update the documentation for this extension? Even just some JavaDoc would be better than nothing.

Sign In or Register to comment.