Options

Question on large number of attributes

jngaijngai Member Posts: 7 Contributor II
edited November 2018 in Help
I am new to RM

I would like to initiate a project to produce a neural network.

Training data each instance has 10 parameters, each parameter have value from a pool of  500  non-English phrases.  There would be thousands of instances with each instance on a line in Excel.

My first thinking is to change these into 500 variables with true/false to show existence of each phrase.

I am not sure this is the correct  way of thinking, and I am wondering RM can handle this vast amount of parameters.  And does RM support non-English text?  I believe it is in Unicode (I am not very familiar with this also).


Appreciate anyone can point me the direction, or answer my concerns.

Thanks in advance

Answers

  • Options
    SebastianLohSebastianLoh Member Posts: 99 Contributor II

    > I would like to initiate a project to produce a neural network.
    Are you sure a neural network is the method you need? For text mining maybe a Naive Bayes or SVM performs better.
    > Training data each instance has 10 parameters, each parameter have value from a pool of  500  non-English phrases.  There would be thousands of instances with each instance on a line in Excel.

    My first thinking is to change these into 500 variables with true/false to show existence of each phrase.
    This sound like a good idea. This process shows you how to do that:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="505" width="949">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="1000"/>
          </operator>
          <operator activated="true" class="discretize_by_frequency" expanded="true" height="94" name="Discretize" width="90" x="179" y="30">
            <parameter key="number_of_bins" value="100"/>
          </operator>
          <operator activated="true" class="nominal_to_binominal" expanded="true" height="94" name="Nominal to Binominal" width="90" x="380" y="30"/>
          <connect from_op="Generate Data" from_port="output" to_op="Discretize" to_port="example set input"/>
          <connect from_op="Discretize" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
          <connect from_op="Nominal to Binominal" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    > I am not sure this is the correct  way of thinking, and I am wondering RM can handle this vast amount of parameters.  And does RM support non-English text?  I believe it is in Unicode (I am not very familiar with this also).
    RM does support different encodings. You can set the encoding style wiht the "encoding" parameter in many Read operators.

    I hope I could help you,

    Ciao Sebastian

Sign In or Register to comment.