[SOLVED] Importing data from a text file

CarlyWarlyCarlyWarly Member Posts: 4 Contributor I
edited November 2018 in Help
Hi all,
I wonder if someone could give me some advice?  I am looking to import data from a text file based on pattern/text matching. For example process a text file similar to the below, looking to extract the field after "Directory of" and the corresponding text before "File(s)" and bytes.

So based on the file text file below, I would have three records:
PathFilesSize
C:\Windows\addins1802
C:\Windows\assembly145,582,046
C:\Windows\AppPatch\en-US1292,352
Any help or hints would be greatly accepted :)

Carl




Directory of C:\Windows\addins

14/07/2009  06:32    <DIR>          .
14/07/2009  06:32    <DIR>          ..
10/06/2009  22:20               802 FXSEXT.ecf
              1 File(s)            802 bytes

Directory of C:\Windows\assembly

12/05/2012  15:24    <DIR>          .
12/05/2012  15:24    <DIR>          ..
10/06/2009  21:39            66,728 big5.nlp
10/06/2009  21:39            82,172 bopomofo.nlp
10/06/2009  21:39           116,756 ksc.nlp
04/01/2012  04:34         4,567,040 mscorlib.dll
10/06/2009  21:40            59,342 normidna.nlp
10/06/2009  21:40            45,794 normnfc.nlp
10/06/2009  21:40            39,284 normnfd.nlp
10/06/2009  21:40            66,384 normnfkc.nlp
10/06/2009  21:40            60,294 normnfkd.nlp
10/06/2009  21:40            83,748 prc.nlp
10/06/2009  21:40            83,748 prcp.nlp
10/06/2009  21:40           262,148 sortkey.nlp
10/06/2009  21:40            20,320 sorttbls.nlp
10/06/2009  21:40            28,288 xjis.nlp
             14 File(s)      5,582,046 bytes

Directory of C:\Windows\AppPatch\en-US

16/04/2011  03:24    <DIR>          .
16/04/2011  03:24    <DIR>          ..
20/11/2010  13:02           292,352 AcRes.dll.mui
              1 File(s)        292,352 bytes

Answers

  • CarlyWarlyCarlyWarly Member Posts: 4 Contributor I
    Hi all,
    Almost there :)  I can get it to process individual files and exact the information but not a single file containing multiple entries :(  

    Below is working the code and sample files, any help or hints would be greatly accepted, cheers,

    Carl


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
       <parameter key="logverbosity" value="all"/>
       <process expanded="true" height="100" width="145">
         <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
           <list key="text_directories">
             <parameter key="Folder" value="D:\RapidMiner\New folder"/>          
           </list>
           <parameter key="file_pattern" value="*.txt"/>
           <parameter key="extract_text_only" value="false"/>
           <parameter key="create_word_vector" value="false"/>
           <parameter key="keep_text" value="true"/>
           <process expanded="true" height="719" width="1022">
             <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="447" y="30">
               <parameter key="query_type" value="Regular Expression"/>
               <list key="string_machting_queries"/>
               <list key="regular_expression_queries">
                 <parameter key="Path" value="Directory of ([A-Za-z0-9:\\]*)"/>
                 <parameter key="Files" value="([0-9]*) File\(s\)"/>
                 <parameter key="Size" value="([0-9]*) bytes"/>
               </list>
               <list key="regular_region_queries"/>
               <list key="xpath_queries"/>
               <list key="namespaces"/>
               <list key="index_queries"/>
             </operator>
             <connect from_port="document" to_op="Extract Information" to_port="document"/>
             <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
         <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>


    File1.txt



    Directory of C:\Windows\addins

    14/07/2009  06:32    <DIR>          .
    14/07/2009  06:32    <DIR>          ..
    10/06/2009  22:20               802 FXSEXT.ecf
                  1 File(s)            802 bytes



    File2.txt



    Directory of C:\Windows\assembly

    12/05/2012  15:24    <DIR>          .
    12/05/2012  15:24    <DIR>          ..
    10/06/2009  21:39            66,728 big5.nlp
    10/06/2009  21:39            82,172 bopomofo.nlp
    10/06/2009  21:39           116,756 ksc.nlp
    04/01/2012  04:34         4,567,040 mscorlib.dll
    10/06/2009  21:40            59,342 normidna.nlp
    10/06/2009  21:40            45,794 normnfc.nlp
    10/06/2009  21:40            39,284 normnfd.nlp
    10/06/2009  21:40            66,384 normnfkc.nlp
    10/06/2009  21:40            60,294 normnfkd.nlp
    10/06/2009  21:40            83,748 prc.nlp
    10/06/2009  21:40            83,748 prcp.nlp
    10/06/2009  21:40           262,148 sortkey.nlp
    10/06/2009  21:40            20,320 sorttbls.nlp
    10/06/2009  21:40            28,288 xjis.nlp
                 14 File(s)      5,582,046 bytes



    File3.txt



    Directory of C:\Windows\AppPatch\en-US

    16/04/2011  03:24    <DIR>          .
    16/04/2011  03:24    <DIR>          ..
    20/11/2010  13:02           292,352 AcRes.dll.mui
                  1 File(s)        292,352 bytes
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hey Carly,

    probably the Cut Document operator can give you the final boost to accomplish your task.

    Best, Marius
  • CarlyWarlyCarlyWarly Member Posts: 4 Contributor I
    Hi Marcus,
    Thank for the hint, I have managed to split up the main file into chunks and for each chunk, I can get three fields I need.  However, the output is a IOObjectCollection list containing documents.

    Any advise on how to convert/extract the values path, files, size into a nice table?

    regards,
    Carl

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <parameter key="logverbosity" value="status"/>
        <process expanded="true" height="386" width="882">
          <operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="84" y="39">
            <parameter key="file" value="D:\RapidMiner\New folder\import.txt"/>
            <parameter key="extract_text_only" value="false"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="246" y="75">
            <parameter key="query_type" value="Regular Region"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Binominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries">
              <parameter key="Directory" value=" Directory of [A-Z]:\\\\.([0-9]*) bytes"/>
            </list>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="750" width="1022">
              <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information (3)" width="90" x="241" y="81">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="Path" value="Directory of ([A-Za-z0-9:\\]*)"/>
                  <parameter key="Files" value="([0-9]*) File\(s\)"/>
                  <parameter key="Size" value="([0-9,]*) bytes"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information (3)" to_port="document"/>
              <connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Carl,

    try to move the Extract Information operator into a Process Documents operator of its own, as in the process below.

    Best,
      Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.009">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
        <parameter key="logverbosity" value="status"/>
        <process expanded="true" height="403" width="413">
          <operator activated="false" class="text:read_document" compatibility="5.2.005" expanded="true" height="60" name="Read Document" width="90" x="45" y="165">
            <parameter key="file" value="D:\RapidMiner\New folder\import.txt"/>
            <parameter key="extract_text_only" value="false"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.2.005" expanded="true" height="60" name="Create Document" width="90" x="14" y="32">
            <parameter key="text" value=" Directory of C:\Windows\addins&#10;&#10;14/07/2009  06:32    &lt;DIR&gt;          .&#10;14/07/2009  06:32    &lt;DIR&gt;          ..&#10;10/06/2009  22:20              802 FXSEXT.ecf&#10;              1 File(s)            802 bytes&#10;&#10; Directory of C:\Windows\assembly&#10;&#10;12/05/2012  15:24    &lt;DIR&gt;          .&#10;12/05/2012  15:24    &lt;DIR&gt;          ..&#10;10/06/2009  21:39            66,728 big5.nlp&#10;10/06/2009  21:39            82,172 bopomofo.nlp&#10;10/06/2009  21:39          116,756 ksc.nlp&#10;04/01/2012  04:34        4,567,040 mscorlib.dll&#10;10/06/2009  21:40            59,342 normidna.nlp&#10;10/06/2009  21:40            45,794 normnfc.nlp&#10;10/06/2009  21:40            39,284 normnfd.nlp&#10;10/06/2009  21:40            66,384 normnfkc.nlp&#10;10/06/2009  21:40            60,294 normnfkd.nlp&#10;10/06/2009  21:40            83,748 prc.nlp&#10;10/06/2009  21:40            83,748 prcp.nlp&#10;10/06/2009  21:40          262,148 sortkey.nlp&#10;10/06/2009  21:40            20,320 sorttbls.nlp&#10;10/06/2009  21:40            28,288 xjis.nlp&#10;              14 File(s)      5,582,046 bytes&#10;&#10; Directory of C:\Windows\AppPatch\en-US&#10;&#10;16/04/2011  03:24    &lt;DIR&gt;          .&#10;16/04/2011  03:24    &lt;DIR&gt;          ..&#10;20/11/2010  13:02          292,352 AcRes.dll.mui&#10;              1 File(s)        292,352 bytes"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="5.2.005" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
            <parameter key="query_type" value="Regular Region"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Binominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries">
              <parameter key="Directory" value=" Directory of [A-Z]:\\\\.([0-9]*) bytes"/>
            </list>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="403" width="299">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.2.005" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
            <process expanded="true" height="421" width="778">
              <operator activated="true" class="text:extract_information" compatibility="5.2.005" expanded="true" height="60" name="Extract Information (3)" width="90" x="246" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="Path" value="Directory of ([A-Za-z0-9:\\]*)"/>
                  <parameter key="Files" value="([0-9]*) File\(s\)"/>
                  <parameter key="Size" value="([0-9,]*) bytes"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information (3)" to_port="document"/>
              <connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • CarlyWarlyCarlyWarly Member Posts: 4 Contributor I
    All I can say is thanks and solved :)

    Carl
Sign In or Register to comment.