🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

"text mining"

ghina84ghina84 Member Posts: 5 Contributor II
edited May 23 in Help

hello everybody..

which operator I should use to load a serie of text files (.txt or .xml)?????

thank you,

laura

Answers

  • emolanoemolano Member Posts: 13 Contributor II
    The text plug in come with some examples. They use TextInput that point to a directory with files.
    You can also use ExampleSource and then StringTextInput... I learned from the examples :)
  • ghina84ghina84 Member Posts: 5 Contributor II

    I tried with the TextInput too, looking at the example, but the output of the node is not the one it should be:

    instead of having a table with documents in the lines and terms as columns, I get a table with COLUMNS=DOCUMENTS
  • emolanoemolano Member Posts: 13 Contributor II
    I do not quite understand what you are trying to accomplish. could you explain your process a bit?
  • ghina84ghina84 Member Posts: 5 Contributor II

    sure  :)

    My goal is to analyse a serie of articles in .txt format.

    To do this I have to load the .txt files using for example TextInput.

    Looking at this example http://nemoz.org/joomla/content/view/65/53/lang,de/ the output of this opertator SHOULD be a table like this:

    -ROWS: articles
    -COLUMNS: terms

    (this is written right after the second image in the page I gave you the link).
    This matrix, usually called Document Term Matrix,  tells you each document (rows) which words (columns) contains, so is a sparse matrix of binary values, and it is used in the next steps of the analysis.

    BUT...instead of this, I get a table like this:

    -ROWS:progressive id of the article
    -COLUMNS:article (i.e. all the text of each article is the label of an attribute!!!)

    ...and I don't know:

    1) if this is correct...but I don't think so

    2) how to solve the problem

    I hope I explain myself better...thank you for the reply and for the help!!

    ciao,

    laura

  • emolanoemolano Member Posts: 13 Contributor II
    Ciao,
    It should be:
    -ROWS: article id 
    -COLUMNS: terms
    You see -ROWS: id number because you define the id_attribute_type as number.
    if you change the id_attribute_type to use short or long instead, you will get the filename or filename+path of the article. The idea here is that you do not get the whole article just a reference id to the article.
    You should get
    -COLUMNS: terms (this output may look as the article's words but depends on the operators you add under TextInput. Those operators are a filter to get a better output)
    e



Sign In or Register to comment.