GUI Field Question

jaskiemrjaskiemr Member Posts: 8 Contributor II
edited November 2018 in Help
Hi, I'm doing some simple text analysis and to get started, I'm reading in a number of HTML pages, 

Answers

  • jaskiemrjaskiemr Member Posts: 8 Contributor II
    Oops, hit enter too soon.

    Anyhow, I'm trying to do some text analysis and I'm reading in HTML pages, lowercasing everything, tokenizing everything, and then filtering out english stop words.
    My question is, in the exampleset textinput view of the statistics, what does the statistics column represent? Is it the percent of times a word appears in the total set of words or is it the percent of documents that a word appears in?
    Also, what is the range column?

    I didn't see the answer in the GUI tutorial.

    Any help is appreciated. Thanks,
    mj
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    it's quite simple: These columns are independent from the actual source of data. It simply shows some general statistics as mean and standard deviation of all numerical attributes. If you have loaded your text in TFIDF representation, it shows you the mean and standard deviaiton of the TDIDF values. As does the range, whose name is quite self-explanatory I think...

    Greetings,
      Sebastian
  • jaskiemrjaskiemr Member Posts: 8 Contributor II
    Sebastian, thank you for your reply.

    I understand range mathematically, however what does it mean in the text mining domain? If I have a range of the word "Hello" from 0 to .003 and a mean of .002 (I'm making this up), the discrete nature of the word doesn't fit in the definition of the range in my head.

    Forgive my empty head.
    Again, thanks.
              mj
  • jaskiemrjaskiemr Member Posts: 8 Contributor II
    I got home so I can put in a concrete example.
    I see that "html" has a value type of "real", average of 0.088 +/- 0.073, range of 0.003 to 0.0530.
            mj
  • jaskiemrjaskiemr Member Posts: 8 Contributor II
    Okay, I think I figured part of it out. I've got 2 documents, one w/ "hello world" and another with "hello". Vector_creation is "term occurrences".  Mean comes out to 1 for hello since it's in both documents and std dev of 0. World mean is .5 because it's in half of the documents. Can't figure out std dev yet.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    why not? it's just the standard deviation of the values of this attribute. Ignoring if it's the number of occurrences, a tf idf representation or simply a temperature. Where's the problem in calculating a standard deviation from two values?

    Greetings,
      Sebastian
Sign In or Register to comment.