Anyhow, I'm trying to do some text analysis and I'm reading in HTML pages, lowercasing everything, tokenizing everything, and then filtering out english stop words. My question is, in the exampleset textinput view of the statistics, what does the statistics column represent? Is it the percent of times a word appears in the total set of words or is it the percent of documents that a word appears in? Also, what is the range column?

Hi, it's quite simple: These columns are independent from the actual source of data. It simply shows some general statistics as mean and standard deviation of all numerical attributes. If you have loaded your text in TFIDF representation, it shows you the mean and standard deviaiton of the TDIDF values. As does the range, whose name is quite self-explanatory I think...

I understand range mathematically, however what does it mean in the text mining domain? If I have a range of the word "Hello" from 0 to .003 and a mean of .002 (I'm making this up), the discrete nature of the word doesn't fit in the definition of the range in my head.

Okay, I think I figured part of it out. I've got 2 documents, one w/ "hello world" and another with "hello". Vector_creation is "term occurrences". Mean comes out to 1 for hello since it's in both documents and std dev of 0. World mean is .5 because it's in half of the documents. Can't figure out std dev yet.

Hi, why not? it's just the standard deviation of the values of this attribute. Ignoring if it's the number of occurrences, a tf idf representation or simply a temperature. Where's the problem in calculating a standard deviation from two values?

## Answers

8Contributor IIAnyhow, I'm trying to do some text analysis and I'm reading in HTML pages, lowercasing everything, tokenizing everything, and then filtering out english stop words.

My question is, in the exampleset textinput view of the statistics, what does the statistics column represent? Is it the percent of times a word appears in the total set of words or is it the percent of documents that a word appears in?

Also, what is the range column?

I didn't see the answer in the GUI tutorial.

Any help is appreciated. Thanks,

mj

2,531Unicornit's quite simple: These columns are independent from the actual source of data. It simply shows some general statistics as mean and standard deviation of all numerical attributes. If you have loaded your text in TFIDF representation, it shows you the mean and standard deviaiton of the TDIDF values. As does the range, whose name is quite self-explanatory I think...

Greetings,

Sebastian

8Contributor III understand range mathematically, however what does it mean in the text mining domain? If I have a range of the word "Hello" from 0 to .003 and a mean of .002 (I'm making this up), the discrete nature of the word doesn't fit in the definition of the range in my head.

Forgive my empty head.

Again, thanks.

mj

8Contributor III see that "html" has a value type of "real", average of 0.088 +/- 0.073, range of 0.003 to 0.0530.

mj

8Contributor II2,531Unicornwhy not? it's just the standard deviation of the values of this attribute. Ignoring if it's the number of occurrences, a tf idf representation or simply a temperature. Where's the problem in calculating a standard deviation from two values?

Greetings,

Sebastian