GUI Field Question

jaskiemr · December 2009

Hi, I'm doing some simple text analysis and to get started, I'm reading in a number of HTML pages,

jaskiemr · December 2009

Oops, hit enter too soon.

Anyhow, I'm trying to do some text analysis and I'm reading in HTML pages, lowercasing everything, tokenizing everything, and then filtering out english stop words.
My question is, in the exampleset textinput view of the statistics, what does the statistics column represent? Is it the percent of times a word appears in the total set of words or is it the percent of documents that a word appears in?
Also, what is the range column?

I didn't see the answer in the GUI tutorial.

Any help is appreciated. Thanks,
mj

land · December 2009

Hi,
it's quite simple: These columns are independent from the actual source of data. It simply shows some general statistics as mean and standard deviation of all numerical attributes. If you have loaded your text in TFIDF representation, it shows you the mean and standard deviaiton of the TDIDF values. As does the range, whose name is quite self-explanatory I think...

Greetings,
Sebastian

jaskiemr · December 2009

Sebastian, thank you for your reply.

I understand range mathematically, however what does it mean in the text mining domain? If I have a range of the word "Hello" from 0 to .003 and a mean of .002 (I'm making this up), the discrete nature of the word doesn't fit in the definition of the range in my head.

Forgive my empty head.
Again, thanks.
mj

jaskiemr · December 2009

I got home so I can put in a concrete example.
I see that "html" has a value type of "real", average of 0.088 +/- 0.073, range of 0.003 to 0.0530.
mj

jaskiemr · December 2009

Okay, I think I figured part of it out. I've got 2 documents, one w/ "hello world" and another with "hello". Vector_creation is "term occurrences". Mean comes out to 1 for hello since it's in both documents and std dev of 0. World mean is .5 because it's in half of the documents. Can't figure out std dev yet.

land · December 2009

Hi,
why not? it's just the standard deviation of the values of this attribute. Ignoring if it's the number of occurrences, a tf idf representation or simply a temperature. Where's the problem in calculating a standard deviation from two values?

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

GUI Field Question

Answers