RapidMiner

3 weeks ago

 So we have a bunch of files in different folders and when we bring them in to RapidMiner for analysis, we believe the folder name to be important as an input or simply as an identifying piece of information so we want to read it.

 

I admit this is probably one of the things one never expects to have to do, and yet, I had to do this for a customer; and learning to do it deepens ones skills and showcases the flexibility of RapidMiner.

 

If you follow this article carefully, you can use the attached process and repeat what  we have done here. There will be no data files attached because in every case these will be different and at different places in your file system. Let me explain what I am using and why:

 

- The Text Processing extension installed.

- 2 empty text files called: applezz.txt and orangezz.txt

- 1 folder in my Documents folder containing two folders with the name apple and orange which, in turn, contain their respective text files mentioned above. And to make things clear:

        C:\Users\KonstantinosBonikos\Documents\delete\apple\applezz.txt

        C:\Users\KonstantinosBonikos\Documents\delete\orange\orangezz.txt

- The number 46. This number will be different for you and is derived by counting the number of characters before the name of the folder we want. In this case:

                                              Loop Folder Names Article 1.png

1. First we place a Loop Files subprocess operator on to the Process area. Make sure to tick both the recursive and enable macros tickboxes as below.

Loop Folder Names Article 2.png

Don't worry if the enable parallel execution is not an option for the version of RapidMiner you are using, it is not important here.

Make sure to point the directory where you have your folders saved.

 

2. Double-click the Loop Files subprocess operator and place a Read Document and a Process Documents operator with default values while connecting them as normal (like in the screenshot below):

                                                     Loop Folder Names Article 3.png

Inside the Process Documents operator is empty with a through connection:

 

 Loop Folder Names Article 4.png

What we are doing here is reading the applezz.txt and orangezz.txt files as documents and by processing them, we are importing their path name as metadata.

 

3. We now take the data that is produced, which looks like this:

Loop Folder Names Article 5.png

This is where the counting becomes important. We are going to create a couple of attributes next based on the metadata_path.

 

4. Connect the data output to a Generate Attributes operator and create the following attributes using formulas.

Loop Folder Names Article 6.png

                            Loop Folder Names Article 7.png

- The ClassName attribute is set to whatever the folder_name value is, using the expression %{folder_name}

Remember folder_name was set as a macro by the Loop Files operator when we selected enable macros in step 1.

- The FolderName attribute is set by using cut(Nominal text, Numeric start, Numeric length).

   - Nominal text is the folder name as represented by %{folder_name}

   - Numeric start This means we need to know where the folder name starts in the path name, and in my case, it was at position 46.

   - Numeric length This represents how many characters we count; and as these vary with folder name and it has to be a number. Therefore, we count the lenght of the total folder name and subtract the number of characters where the name we want starts by length(%{folder_name})-46.

 

5. Run the process and we get the following results:

                                                Loop Folder Names Article 8.png

Loop Folder Names Article 9.pngLoop Folder Names Article 10.png

Which evidently, give us folder names as data.

 

Feel free to download the attached process as an .rmp file. These can be imported by File>Import Process.