RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
Filter: 1) extract numeric information from text column 2) select attributes subset based on a table
i'm kinda new dealing with RapidMiner, and hope some of you in the community is able to help me with my problem. I have already experience with other ETL and data management tools but did not find a way within RapidMiner to tackle it correctly.
I have two questions. 1 is more important, 2 is nice2know
1) extract numeric information with a pattern in text column
I am trying to solve a data preparation task, where i have an attribute with type text containing sentences, descriptions in german language and numeric values which are interesting as well. Therefore I already prepared some tutorials, searched in the community for possible solutions and experimented with operators and their parameters, including RegEx logic.
Within the use-case it is needed to multiply the dataset for on the one hand side extract text information and on the other hand extract the values included in the text column to match them in the end together. Performing text patterns is no problem with the Text Processing Extension (Transform Cases, Tokenize, Filter Stopwords, Stem), but the extracting the numerical data included in the same column in a separate process makes me frustrating
Describing it in RegEx the searched pattern is describable by ([€0-9.;,\- ]+[-€]). All other information can be removed from the text in the column in the numeric information extraction stream.
What i tried so far:
- Process documents from data: i had a problem with the "€" character, so i tried to replace it with "E" after I transformed the cases [A-Z] to [a-z] in order to have only the E left in the data - within the process documents and outside. But i was not able to get the expected results - via process documents it was in most of the cases empty (no results)
- replaceAll (Replace, Generate Attributes, Tokenize) function with RegEx. I tried to exclude a lot of character combination, but in the end it was too complex, so i focused on the inverse function which is describable by ([€0-9.;,\- ]+[-€])
Is it possible to receive an inverse of a RegEx, that instead of replacing leave the already identified pattern values active as a resultset. Preferred within a new column in order to compare/qualify it with the text? Or do you know a way where i can easily extract numeric values following a pattern (€ sign before or after with some special characters like 50.000,-- € , €500, 500€). Based on this information i wanted to create some new metrics so I am actually stuck in process.
2) Select attributes based on another table (e.g. by extending join condition)
I have a data table containing 82 attributes which i want to reduce, but not the imported table because if I want to reload different samples I would have to manipulate every sample set.
Instead of selecting a subset manually (Retrieve -> Select Attributes) it would be great to be able to script the relevant features for future cases. Therefore I thought about generating a second table with only the features without observations for a flag (just master data) - logic where only the attributes receiving the flag value 1 are relevant.
Is it possible to define a subset based on a table input? Tried it with the weighted operator but was actually not really successfull.
Hope you are able to understand my challenges and give me some hints based on your RapidMiner experiences