Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"[SOLVED] Importing: text with fixed length attributes"

UgoUgo Member Posts: 20 Contributor II
edited June 2019 in Help
Hello,

I have looked through the import components and Googled but cannot find a way to
read in a simple text file with fixed length attributes. How can one do this?

Apologies if this is a "no brainer" but I could not find a simple direct way to do this.

TIA,
Hugo    

Answers

  • SkirzynskiSkirzynski Member Posts: 164 Maven
    Hey,

    You probably need the "Read CSV" operator. This operator can read a structured data set from a text file. Use the wizard to import and configure this operator correctly. For instance it is important to specify the separator so the operator knows where the value for an attribute begins and stops.

    Best regards
      Marcin
  • UgoUgo Member Posts: 20 Contributor II
    Hi Marcin,

    I already had a look at the CSV reader but it requires the use of a delimiter.
    The file I have has no such delimiters. As an example, assume I have a line:

    AABBCCCC

    In this case I have 3 attributes with lengths 2, 2 and 3 respectively.
    The attribute value would be AA, BB and CCCC. Note that no separator
    exists.

    Write now I am preparing AWK scripts to deal with this but I assumed
    Rapidminer can deal with this type of data easily.

    Thanks for the feedback.



  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    You could read the text file using the regular expression "\r\n" to read each complete line.

    Then use the operator Generate Extract to split each line into the required components using regular expressions.

    Here's an example
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="679" width="841">
          <operator activated="true" class="read_csv" compatibility="5.3.000" expanded="true" height="60" name="Read CSV" width="90" x="112" y="75">
            <parameter key="csv_file" value="C:\logs\fixedwidth.txt"/>
            <parameter key="column_separators" value="&quot;\r\n&quot;"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="att1.true.polynominal.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="text:generate_extract" compatibility="5.3.000" expanded="true" height="60" name="Generate Extract" width="90" x="246" y="75">
            <parameter key="source_attribute" value="att1"/>
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="a1" value="(.{2})"/>
              <parameter key="a2" value="(?:.{2})(.{3})"/>
              <parameter key="a3" value="(?:.{5})(.{3})"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The text file contains this
    AABBCCCC
    ABCDBBDS
    ABDBQBDD
    AASHHFGU
    and the result looks like this
    a1	a2	a3
    AA BBC CCC
    AB CDB BDS
    AB DBQ BDD
    AA SHH FGU
    regards

    Andrew
  • UgoUgo Member Posts: 20 Contributor II
    Hi Andrew,

    Exactly what I was looking for.

    Thank you.
    Hugo F.
Sign In or Register to comment.