Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"text mining"

ghina84ghina84 Member Posts: 5 Contributor II
edited May 2019 in Help

hello everybody..

which operator I should use to load a serie of text files (.txt or .xml)?????

thank you,

laura

Answers

  • emolanoemolano Member Posts: 13 Contributor II
    The text plug in come with some examples. They use TextInput that point to a directory with files.
    You can also use ExampleSource and then StringTextInput... I learned from the examples :)
  • ghina84ghina84 Member Posts: 5 Contributor II

    I tried with the TextInput too, looking at the example, but the output of the node is not the one it should be:

    instead of having a table with documents in the lines and terms as columns, I get a table with COLUMNS=DOCUMENTS
  • emolanoemolano Member Posts: 13 Contributor II
    I do not quite understand what you are trying to accomplish. could you explain your process a bit?
  • ghina84ghina84 Member Posts: 5 Contributor II

    sure  :)

    My goal is to analyse a serie of articles in .txt format.

    To do this I have to load the .txt files using for example TextInput.

    Looking at this example http://nemoz.org/joomla/content/view/65/53/lang,de/ the output of this opertator SHOULD be a table like this:

    -ROWS: articles
    -COLUMNS: terms

    (this is written right after the second image in the page I gave you the link).
    This matrix, usually called Document Term Matrix,  tells you each document (rows) which words (columns) contains, so is a sparse matrix of binary values, and it is used in the next steps of the analysis.

    BUT...instead of this, I get a table like this:

    -ROWS:progressive id of the article
    -COLUMNS:article (i.e. all the text of each article is the label of an attribute!!!)

    ...and I don't know:

    1) if this is correct...but I don't think so

    2) how to solve the problem

    I hope I explain myself better...thank you for the reply and for the help!!

    ciao,

    laura

  • emolanoemolano Member Posts: 13 Contributor II
    Ciao,
    It should be:
    -ROWS: article id 
    -COLUMNS: terms
    You see -ROWS: id number because you define the id_attribute_type as number.
    if you change the id_attribute_type to use short or long instead, you will get the filename or filename+path of the article. The idea here is that you do not get the whole article just a reference id to the article.
    You should get
    -COLUMNS: terms (this output may look as the article's words but depends on the operators you add under TextInput. Those operators are a filter to get a better output)
    e



Sign In or Register to comment.