"Text Classification with different terms"

dport · April 2010

I would like to classify an example set based on a classification model generated from a related but different example set. The terms will not be identical. Is it reasonable to supply the word list form the model to the example set I wish to classify?

The model I am experimenting with is listed below. It seems to give pretty decent results but I have yet to give it full check (this would require a lot of data preparation).

Any feedback appreciated!


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="448" width="748">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
        <parameter key="repository_entry" value="team_x_risks_no_dups"/>
      </operator>
      <operator activated="true" class="nominal_to_text" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="risk_title risk_desc_risk_keywords_risk_factor_description"/>
        <parameter key="attributes" value="risk_all"/>
      </operator>
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="300">
        <parameter key="repository_entry" value="team_x_risk_cats"/>
      </operator>
      <operator activated="true" class="nominal_to_text" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="300"/>
      <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="300">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="45" y="75"/>
          <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="210"/>
          <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="300"/>
          <operator activated="true" class="text:filter_by_length" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="458" y="288">
            <parameter key="min_chars" value="3"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" expanded="true" height="60" name="Generate n-Grams (2)" width="90" x="715" y="120"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_nn" expanded="true" height="76" name="k-NN" width="90" x="447" y="300">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="75">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="75"/>
          <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="210"/>
          <operator activated="true" class="text:filter_by_length" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="120">
            <parameter key="min_chars" value="3"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="514" y="30"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="581" y="165">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Retrieve (2)" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="k-NN" to_port="training set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="k-NN" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Jepse · April 2010

Hey,

can you provide those two input files (team_x_*)?

dport · April 2010

I cannot supply the entire repositories, but here are some samples of the repository Team_X_risk_cats in CSV:

subsystem,class,subclass,label,risk
Systems,General Systems Risks,Organizational ,Systems General Systems Risks Organizational ,Multiple collaborating implementing organizations
Systems,General Systems Risks,Organizational ,Systems General Systems Risks Organizational ,Multiple sponsors
Systems,General Systems Risks,Organizational ,Systems General Systems Risks Organizational ,Geographic distribution of collaborating organizations
Systems,General Systems Risks,Organizational ,Systems General Systems Risks Organizational ,International collaborators (ITAR and business model considerations)
Systems,General Systems Risks,Organizational ,Systems General Systems Risks Organizational ,New partner or sponsor
Systems,General Systems Risks,Programmatics ,Systems General Systems Risks Programmatics ,Long lead items
Systems,General Systems Risks,Programmatics ,Systems General Systems Risks Programmatics ,"Schedule constraints (arising from NASA AO, or other reasons why any phase is of unusual length) "
Systems,General Systems Risks,Programmatics ,Systems General Systems Risks Programmatics ,Highly constrained launch date (e.g. for certain planetary trajectories)
Systems,General Systems Risks,Technology Development and Heritage ,Systems General Systems Risks Technology Development and Heritage ,Low TRL /New Technology

*************

And samples of the team_x_risks_no_dups repository in CSV:

ID,study_title,risk_all
46,Titan-Aero,SOI burn failure
53,Titan-Aero,Keeping the temperature in range for the length of mission
55,Titan-Aero,Long lead items
58,Titan-Aero,"Solar Power Since the array is so huge, there could be a risk of losing part of the array, either to damage or some other malfunction. This could be drastic for the trajectory "
59,MER-Viking,Miss target deposit Targetting ellipse is about 2km major axis. Remote sensing spatial resolution may be ~cm. (Neutron spectrometer surface spatial resolution =~ altitude). May not be able to deliver sufficiently close to desired target
67,Heavy Payload to L1,architecture/geometry not flexible enough for all forms of payload.
72,Heavy Payload to L1,Cruise too short for anomaly resolution
76,Heavy Payload to L1,Uncertainty - might have missed System-level issues at Payload interface
80,Heavy Payload to L1,Availability of solar arrays Lightweight solar array technology for high power may not be vailable
83,Heavy Payload-FH,Uncertainty - estimates incorrect for new techno
90,Heavy Payload-FH,Cruise too short for anomaly resolution
94,Heavy Payload-FH,Uncertainty - might have missed System-level issues at Payload interface
102,Heavy-Payload-CRYO,architecture/geometry not flexible enough for all forms of payload.
107,Heavy-Payload-CRYO,Cruise too short for anomaly resolution
111,Heavy-Payload-CRYO,Uncertainty - might have missed System-level issues at Payload interface
113,Heavy-Payload-CRYO,No driver This is pilotless!
115,Heavy-Payload-CRYO,Availability of solar arrays Lightweight solar array technology for high power may not be vailable

land · April 2010

Hi,
if you are going to apply a model on new unseen data, you MUST use the old word list. Otherwise the attributes wouldn't be the same and the model can't work correctly.
But if you want to train a new model, you should create a new word list, since it's better fitted to the data and might better catch it's properties.

Greetings,
Sebastian

dport · April 2010

Thanks for the response.

So I see, the word list output of ProcessDocuments from my unseen data example set I wish to apply the model on MUST be in the input for ProcessDocuments for the training data example set. I'm not sure what you mean by creating a new word list. Do you mean create a word list from ProcessDocuments for the training data example set then use this as input for the ProcessDocuments from my unseen data example set? If so, I thought that was my original question! Can you explain what you mean here that is different than this?

TobiasMalbrecht · April 2010

Hi,

you are right, Sebastian meant you should use the word list output in training as input for the [tt]Process Documents[/tt] in the application phase.

Best regards,
Tobias

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Classification with different terms"

Answers