"text mining"

ghina84 · May 2009

hello everybody..

which operator I should use to load a serie of text files (.txt or .xml)?????

thank you,

laura

emolano · May 2009

The text plug in come with some examples. They use TextInput that point to a directory with files.
You can also use ExampleSource and then StringTextInput... I learned from the examples

ghina84 · May 2009

I tried with the TextInput too, looking at the example, but the output of the node is not the one it should be:

instead of having a table with documents in the lines and terms as columns, I get a table with COLUMNS=DOCUMENTS

emolano · May 2009

I do not quite understand what you are trying to accomplish. could you explain your process a bit?

ghina84 · May 2009

sure

My goal is to analyse a serie of articles in .txt format.

To do this I have to load the .txt files using for example TextInput.

Looking at this example http://nemoz.org/joomla/content/view/65/53/lang,de/ the output of this opertator SHOULD be a table like this:

-ROWS: articles
-COLUMNS: terms

(this is written right after the second image in the page I gave you the link).
This matrix, usually called Document Term Matrix, tells you each document (rows) which words (columns) contains, so is a sparse matrix of binary values, and it is used in the next steps of the analysis.

BUT...instead of this, I get a table like this:

-ROWS:progressive id of the article
-COLUMNS:article (i.e. all the text of each article is the label of an attribute!!!)

...and I don't know:

1) if this is correct...but I don't think so

2) how to solve the problem

I hope I explain myself better...thank you for the reply and for the help!!

ciao,

laura

emolano · June 2009

Ciao,
It should be:
-ROWS: article id
-COLUMNS: terms
You see -ROWS: id number because you define the id_attribute_type as number.
if you change the id_attribute_type to use short or long instead, you will get the filename or filename+path of the article. The idea here is that you do not get the whole article just a reference id to the article.
You should get
-COLUMNS: terms (this output may look as the article's words but depends on the operators you add under TextInput. Those operators are a filter to get a better output)
e

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"text mining"

Answers