Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Repository size question
Assume following process :
around 10K articles in different languages, 80% are Latin based, 10% Greek based and 10% is Cyrillic
My process is using unicode range filtering to define the language of the content, and then stores the matching ones in a repository for furter handling.
so all articles identified as cyrillic go to a cyrillic repository,
all Greek articles go to a Greek repository
everything else goes to the latin repository.
So far so good, this was rather easy to accomplish but I noticed that all my repositories are (almost) equal in size, even if the cyrillic and Greek content account for about 10% of the content and should therefore be much smaller as the Latin one.
Loading the Greek repo only shows indeed Greek content, the Cyrillic only Cyrillic, so it seems ok.
However, when viewing the directory with a text editor directly all of them show basically all of the content. So both the filtered content (which is shown when opening the repo) and the inverted results (which remain hidden)
Why is this redundant data stored, and how can I get rid of this to reduce size ?
Thanks!
around 10K articles in different languages, 80% are Latin based, 10% Greek based and 10% is Cyrillic
My process is using unicode range filtering to define the language of the content, and then stores the matching ones in a repository for furter handling.
so all articles identified as cyrillic go to a cyrillic repository,
all Greek articles go to a Greek repository
everything else goes to the latin repository.
So far so good, this was rather easy to accomplish but I noticed that all my repositories are (almost) equal in size, even if the cyrillic and Greek content account for about 10% of the content and should therefore be much smaller as the Latin one.
Loading the Greek repo only shows indeed Greek content, the Cyrillic only Cyrillic, so it seems ok.
However, when viewing the directory with a text editor directly all of them show basically all of the content. So both the filtered content (which is shown when opening the repo) and the inverted results (which remain hidden)
Why is this redundant data stored, and how can I get rid of this to reduce size ?
Thanks!
0
Answers
was your text set to polynominal or text? If you use text this should not happen. With polynominal RM is using a mapping table in the background. This mapping table is not cleaned up if you filter. You need to use Remove Unused to force this.
~Martin
Dortmund, Germany