Character encoding problem with AttributeFilter

Contributor II

Character encoding problem with AttributeFilter

I have the following problem (bug?). I want to do the following:

1. Load data with an ExcelExampleSource-Operator (the data is labeled, e.g. the first line contains the labels of the Excel-columns)
2. Apply an AttributeFilter to the loaded data by filtering certain attribute names.

The Excel input file is German, therefore there can be German Umlaute like ä, ö, ü contained in the column-labels.
In the AttributeFilter operator I set parameter "condition_class" to the value "attribute_name_filter". As a parameter string I use a regular expression containing German Umlaute like "Häuser|Bäume".
Therefore in the root operator I set the encoding to UTF-16:
<parameter key="encoding"	value="UTF-16"/>

Since I work with the GUI-version of RapidMiner, I now want to switch from the XML-editor tab to the parameter editor tab. And now it happens, I receive the following error message:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. Cancel to ignore changes, Ok to go on editing.

As soon as I remove the Umlaute, everything works fine. It somehow seems to expect the regular expression to be UTF-8 whereas it really should be treated as UTF-16, but that's only a guess.

I can temporarily change the column labels in the input data file to not using German Umlaute, however in the long run that's no real option. Any suggestions?

Re: Character encoding problem with AttributeFilter


we are already aware of that problem. However we unfortunately have not yet found a solution to overcome that problem. Sorry, but for now you have to stick to the dirty solution by renaming the attributes before loading the data into RapidMiner. But we well keep trying to solve the problem, however I doubt we will be able to accomplish this in the short term.

Contributor II

Re: Character encoding problem with AttributeFilter

Okay, thanks for the answer. As you said, there are several possible workarounds.