convert document files to transaction dataset


convert document files to transaction dataset

I am a new on text mining and rapidminer. I want to prepare a dataset  to create a model with my algorithm. The dataset should contain one row for each text document and each row consists of words contained in the document (separated by comma). Moreover,the words in dataset should be passed the preprocessing steps. token, stop word remove,stem, n-gram.

Please help me

Thank you
Regular Contributor

Re: convert document files to transaction dataset

Typically you are not using a data structure for text mining where the terms are stored as strings separated by comma, but you create word vectors which have one attribute for every word. Every documents becomes a row (vector) and the values for the each attribute (word) depends on the vector creation method (usually you want to use TF-IDF).

Here is an example process with two hard-coded documents (use "Process Documents from Files" to read from a set of files). Inside the "Process Documents" operator you will see a "Tokenize" and "Filter stopwords" operator.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.009">
  <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
        <parameter key="text" value="This is a book on data mining"/>
        <parameter key="label_value" value="text1"/>
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="120">
        <parameter key="text" value="This book describes data mining and text mining using RapidMiner"/>
        <parameter key="label_value" value="text2"/>
      <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="179" y="30">
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>

The resulting example set can be used to learn models like with any other numerical data set. In text mining it is common to use the SVM for classification, e.g..