Parsing attributes

adamfadamf Member Posts: 34 Contributor I
edited December 2018 in Help



Perhaps this is a simple question with a simple answer.


I am building a predictive model.  As input I have several attributes, two of which are actually lists of words.  For example, one attribute is called "keywords", and it contains a variable number of key terms.  


I'm wondering if this attribute, which is really a list of terms, is being treated as a single text string/blob, rather than being parsed into individual words/tokens.  RapidMiner's Auto Model suggests that this attribute is NOT helpful to the predictive modeling process, but I think that is because it is treating this attribute - which is actually a list of terms - as a single text string.


Thus, my questions are:


1) I assume that most/all models will treat quite differently a field such as this if it is treated a single text string vs. a list of individual keywords?


2) I don't know how to parse/tokenize this attribute so that what the model sees is a list of individual keywords rather than a single text string/blob.


Thanks in advance for any assistance or clarification.


- Adam



  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 363 RM Data Scientist

    Hi @adamf, have you tried text processing?


    You can leverage the term frequences from tf-idf for the predictive model.





  • adamfadamf Member Posts: 34 Contributor I

    Hello YY,


    I am familiar with the text procesing techniques that are described in your linked PDF file.  However, I don't think that fully answers the question. 


    The text fields/attributes in question add information about each item/row in the example set.  For example, one of the text field columns contains a list of "categories" (classification) into which each of the examples in the example set fall.  Based on the class label of my training data, it appears to me that many of the examples in the example set labeled as "Fraudulent" (vs "Legitimate") mention "Extreme Graphic/Explicit Language" in the Categories column.  However, additional categories may also appear in the example's Categories list, such as "Non-Standard Content".  So, the field is a list of one or more categories and may look like this "Extreme Graphic/Explicit Language Non-Standard Content".


    Thus, my question is multi-part:


    1) My hypothesis is that a predictive model might take advantage of this "Categories" column by, for example, realizing that many examples that have "Extreme Graphic/Explicit Language" mentioned in the Categories column have class label of "Fraudulent". 

    2) However, since the Categories column is currently a concatenation of one or more categories, I am not sure that the data is parsed and processed as I intended.

    3) I am also not sure which (if any) predictive models can take advantage of textual attributes such as my "Categories" attribute.





  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 363 RM Data Scientist

    If the catgories in text column are neat and seperated by some delimiter, you can use "split" to parse them into distributed columns for categories. Otherwise, you can still manually define the binary codes (1/0, true/false) for each seperate category.

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="34">
    <list key="attribute_values">
    <parameter key="categories" value="&quot;Extreme Graphic/Explicit Language Non-Standard Content&quot;"/>
    <list key="set_additional_roles"/>
    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
    <list key="function_descriptions">
    <parameter key="CAT1" value="finds(categories,&quot;Extreme Graphic&quot;)"/>
    <parameter key="CAT2" value="finds(categories,&quot;Explicit Language&quot;)"/>
    <parameter key="CAT3" value="finds(categories,&quot;Non-Standard Content&quot;)"/>
    <parameter key="keep_all" value="true"/>


  • adamfadamf Member Posts: 34 Contributor I

    After doing some reading/researching, I see that in order to be interpretted by most/all predictive models, I will need to convert/map my textual attributes into numeric values, possibly using either a mapping function (for my Categories attribute) and some other function (word2vec?) for the Keywords column.  Please let me know if you have specific suggestions or recommendations.




  • adamfadamf Member Posts: 34 Contributor I

    Thank you.  Your suggestions for the Categories field conversion/mapping is very helpful.


    I have one other textual attribute that is called Keywords.  It consists of a variable number of keywords (as calculated by an NLTK method).  Is there a function (word2vec?) that would be appropriate to convert each keyword list into a "numeric" value, or do I need to separate the list into individual words first and then think about converting each?


    - Adam


  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 363 RM Data Scientist

    Word2vec is available from Marketplace.

    But I do not think word2vec is necessary. TF-Idf may be enough for phrase recognition. Just define a list of strings for the target categories, then use it as the wordlist input for  process document for Tf-idf. 

    The key value from the unstructured text data is the term frequencies of keywords/phrases linked with each category.

  • adamfadamf Member Posts: 34 Contributor I

    Hi @yyhuang,


    Would you please provide a short RM process/example.  I'm still unclear about how TF-IDF helps in this scenario.  I've used TF-IDF primarily for identifying important terms across a corpus of documents.  I'm also uncertain how to combine/include the output of the TF-IDF operator with other attributes that will be input into the model for training/predicting.





  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 363 RM Data Scientist

    Hi Adam @adamf,


    Please refer to the process here for predicting the category of onsale items with text mining.


    My input data has text descriptions of the purchased items (attached is an example input), and also some meta-attributes for the channel, merchant names. Of course you can create a customized wordlist and ust it as the input for text processing (word list input).





    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" breakpoints="after" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve 10 sample with other meta attributes" width="90" x="45" y="34">
    <parameter key="repository_entry" value="10 sample with other meta attributes"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Training Documents" width="90" x="246" y="34">
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="581" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <description align="center" color="transparent" colored="false" width="126">the 1st input can be linked to a wordlist for target phrases</description>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="124" name="Model and Evaluate" width="90" x="447" y="34">
    <process expanded="true">
    <operator activated="true" class="nominal_to_numerical" compatibility="8.2.000" expanded="true" height="103" name="Nominal to Numerical" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="channel|merchantfamilydesc"/>
    <list key="comparison_groups"/>
    <operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="103" name="Multiply (2)" width="90" x="212" y="39"/>
    <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
    <parameter key="number_of_folds" value="3"/>
    <process expanded="true">
    <operator activated="true" class="support_vector_machine_libsvm" compatibility="8.2.000" expanded="true" height="82" name="SVM" width="90" x="179" y="34">
    <parameter key="kernel_type" value="linear"/>
    <parameter key="C" value="3.0"/>
    <parameter key="epsilon" value="0.0010"/>
    <list key="class_weights"/>
    <description align="center" color="transparent" colored="false" width="126">LibSVM do a great job on the bag of words</description>
    <operator activated="false" class="h2o:generalized_linear_model" compatibility="7.5.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="45" y="187">
    <parameter key="family" value="multinomial"/>
    <parameter key="solver" value="IRLSM"/>
    <list key="beta_constraints"/>
    <list key="expert_parameters"/>
    <connect from_port="training set" to_op="SVM" to_port="training set"/>
    <connect from_op="SVM" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    <operator activated="true" class="performance_classification" compatibility="8.2.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
    <parameter key="classification_error" value="true"/>
    <parameter key="spearman_rho" value="true"/>
    <parameter key="kendall_tau" value="true"/>
    <parameter key="absolute_error" value="true"/>
    <parameter key="relative_error" value="true"/>
    <parameter key="correlation" value="true"/>
    <list key="class_weights"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    <operator activated="true" class="performance_to_data" compatibility="8.2.000" expanded="true" height="82" name="Performance to Data" width="90" x="514" y="187"/>
    <operator activated="true" class="extract_macro" compatibility="8.2.000" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="85">
    <parameter key="macro" value="Accuracy"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="Value"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    <operator activated="true" class="generate_macro" compatibility="6.0.002" expanded="true" height="82" name="Generate Macro" width="90" x="782" y="85">
    <list key="function_descriptions">
    <parameter key="Accuracy" value="round(%{Accuracy}*100)"/>
    <connect from_port="in 1" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_port="out 2"/>
    <connect from_op="Cross Validation" from_port="model" to_port="out 1"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_op="Performance to Data" to_port="performance vector"/>
    <connect from_op="Performance to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Performance to Data" from_port="performance vector" to_port="out 3"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="126"/>
    <portSpacing port="sink_out 4" spacing="0"/>
    <description align="center" color="transparent" colored="false" width="126">Build SVM for classify text based on the word vectors</description>
    <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="782" y="34">
    <list key="application_parameters"/>
    <connect from_op="Retrieve 10 sample with other meta attributes" from_port="output" to_op="Process Training Documents" to_port="example set"/>
    <connect from_op="Process Training Documents" from_port="example set" to_op="Model and Evaluate" to_port="in 1"/>
    <connect from_op="Model and Evaluate" from_port="out 1" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Model and Evaluate" from_port="out 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Model and Evaluate" from_port="out 3" to_port="result 2"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="105"/>
    <portSpacing port="sink_result 3" spacing="0"/>


Sign In or Register to comment.