Options

"Bigram Document Vector"

D_MD_M Member Posts: 15 Maven
edited June 2019 in Help
Hi,

I want to create document vector consisting only of bigrams.

For this I am first saving the wordlist using the following operators:-

TextInput
  StringTokenizer

and then I am using

TextInput
    StringTokenizer
    TermNgramGenerator
    StopWordFilterFile (using the previously saved wordlist.)

Is there any better way of doing this?

Answers

  • Options
    D_MD_M Member Posts: 15 Maven
    :)
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I guess it is. Why don't you just use it this way:
    TextInput
        StringTokenizer
        TermNgramGenerator

    The resulting vector will only contain the bi-grams, since it builds the vector from the tokens generated by all inner operators. If no token of a complete word is contained, it will not be part of the vector.
    Or did I misunderstand you completely?

    Greetings,
      Sebastian
  • Options
    D_MD_M Member Posts: 15 Maven
    Thanks Sebastian for replying.

    I tried using only

    TextInput
      StringTokenizer
      TermNGramGenerator

    The problem I am facing is that along with the bigrams, unigrams are also coming to the document vector. I want only bigrams not unigrams. So to prevent this I have to use the StopWordFilter to remove the unigrams.

    Plez let me know if I can achieve this in a much better way?
  • Options
    haddockhaddock Member Posts: 849 Maven
    What Ho D.M !

    Not really understanding much about anything I looked up on Wikipedia to understand what a bigram was, and found the following...
    An n-gram is a subsequence of n items from a given sequence. The items in question can be phonemes, syllables, letters, words or base pairs according to the application.
    Here http://en.wikipedia.org/wiki/N-gram

    What is the context of your application?
  • Options
    D_MD_M Member Posts: 15 Maven
    Sorry, if my question is not clear.

    For me the bigram should composed of sequence of words.

    e.g. For - "the dog smelled like a skunk"  bigrams should be xx_the, the_dog, dog_smelled, smelled_like, like_a, a_shrunk, shrunk_xx.
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    with RapidMiner 5.0 the result should exactly look like what you are expecting it to be. Otherwise there's no possibility to change this, but you could filter the not desired results using the example filter.
    You might specify a regular expression for filtering the attribute according to their names.

    Greetings,
      Sebastian
  • Options
    D_MD_M Member Posts: 15 Maven
    thanks Sebastian for replying.
Sign In or Register to comment.