RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

ExampleSet2AttributeWeights in RapidMiner 5?

thomas0221thomas0221 Member Posts: 4 Contributor I
edited November 2018 in Help
Dear RapidMiner experts,

I have a question of how to make word vector representations of two text datasets consistent in text processing (the same feature set). I have two text datasets. After text processing in RapidMiner, the first text dataset generate 1000 features/word vector. The second text dataset may have 2000 unique words, but I want to use the same feature set of 1000 words from the first one. I can do this in RapidMiner 4.6. However, I have trouble to do the same thing in RapidMiner 5. Could you give me some advice of how to do this in RapidMiner 5?

One way to make word vector representations of two text datasets consistent is to use attribute weights. In RapidMiner 4.6 I can do it, but I do not know how to do it in RapidMiner 5.

In RapidMiner 4.6, I can use ExampleSet2AttributeWeights to process a dataset and assign 1 as weight for all attributes and then write to an attribute weight file. Later I can load the weight file via AttributeWeightsLoader to filter the new text dataset. This approach is used in RapidMiner 4.6 \01_Input\02_TextFromParameterListAndTermList.xml

But the problem is that in RapidMiner 5, I cannot locate ExampleSet2AttributeWeights, so I cannot generate attribute weights (all 1s for words appearing in the first text dataset) from an example set.

Could anyone help how to do this in RapidMiner 5?

Thanks!
Thomas
<?xml version="1.0" encoding="UTF-8"?>
<process version="4.0">

  <operator name="Root" class="Process">
      <description text="#ylt#h3#ygt#Using a predefined term list#ylt#/h3#ygt##ylt#p#ygt#The TextInput operator determines the dimensions of the vector space (and thus the features used for learning) automatically by default by scanning all texts and possibly pruning terms that occur to often or to seldomly. In some applications the user rather wants to provide a predefined list of terms that should be used for purpose
.
#ylt#/p#ygt#
#ylt#p#ygt#This can be achieved by passing an AttributeWeights IOObject to the operator. Only features that have a weight higher than zero are used as dimension of the vector space
. The operator for this purpose is called AttributeWeightSelection.
#ylt#/p#ygt#
#ylt#p#ygt##ylt#b#ygt#Hint
:
#ylt#/b#ygt#If you use stemming, the term list must also contain the stemmed version of the terms, otherwise both will not match.#ylt#/p#ygt#"/>
      <operator name="AttributeWeightsLoader" class="AttributeWeightsLoader">
          <parameter key="attribute_weights_file" value="../data/attribute_weights.xml"/>
      </operator>
      <operator name="TextInput" class="TextInput">
          <parameter key="default_content_language" value="english"/>
          <list key="namespaces">
          </list>
          <list key="texts">
            <parameter key="graphics" value="../data/newsgroup/graphics"/>
            <parameter key="hardware" value="../data/newsgroup/hardware"/>
          </list>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars" value="3"/>
          </operator>
      </operator>
  </operator>

</process>
Sign In or Register to comment.