The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Process document from data

barthosbarthos Member Posts: 20 Contributor II
edited January 2020 in Help
I'm very begginer at Rapid Miner and applying the video tutorials found on (text mining)
I have a problem at the very basic level.
I want to use the tool "Process document from data" to compute binary word vectors
To do so, I load  an excel file with the embedded read excel tool. My file is a unique columns with 500 rows each containing text data. I then send this to the "exa" input of the Process document from data box. In the box, I make some basic processings (tokenize, single case, word filter and token filter). And I connect the "exa" output of the box to the results connector.
The problem is that I dont get vectors but only a two columns table, first column = row numbers (1,2,etc.), second rows titled "text" but with empty cells. The description of the data is : ExampleSet(437 examples, 1 special attribute, 0 regular attributes). What can I do ????

When I put a break point after the read excel tool, I get (in the results) a two columns table, the first one with Row No. and the second with the rows in my excel file. So it looks like the file is red properly...


  • Options
    colocolo Member Posts: 236 Maven
    Hi Barthélémy,

    you have to tell the "Process Documents from Data" operator which attribute shall be treated as text. Usually if you use the similar operators for files or documents this is clear. The document or file body is used as text, but if you have an example set there are many attributes that can potentially contain the text. You have to set this before the processing starts (even if you only have one single attribute). To do so, use the "Nominal to text" operator after "Read Excel". The attribute with type text is then used as document content for the processing inside the following operator.

    Best regards
  • Options
    barthosbarthos Member Posts: 20 Contributor II
    Thanks a lot Mathias, you make me gain about two days of work !
    I'd like to offer you a beer. I'm in paris, what about you?
    Thanks again,
  • Options
    mrfabrittziomrfabrittzio Member Posts: 1 Contributor I

    You sir are brilliant, thx so much!

  • Options
    laurahajnalkalaurahajnalka Member Posts: 3 Contributor I

    Dear Matthias,


    I have a similar problem, but not the same. I have csv files with two columns. One contains words of a document, the other contains the occurrence of the words. I would like to filter the rows I do not need for my model.  I used the "Nominal to text" operator, but I still can not filter the stopwords, because the "Process Documents from Data" operator seems to not working. Whatever I put inside, the result is going to be 0 lines. What can I do?


    Thank you in advance!


Sign In or Register to comment.