Repository size question

kayman · April 2016

Assume following process :

around 10K articles in different languages, 80% are Latin based, 10% Greek based and 10% is Cyrillic

My process is using unicode range filtering to define the language of the content, and then stores the matching ones in a repository for furter handling.

so all articles identified as cyrillic go to a cyrillic repository,
all Greek articles go to a Greek repository
everything else goes to the latin repository.

So far so good, this was rather easy to accomplish but I noticed that all my repositories are (almost) equal in size, even if the cyrillic and Greek content account for about 10% of the content and should therefore be much smaller as the Latin one.

Loading the Greek repo only shows indeed Greek content, the Cyrillic only Cyrillic, so it seems ok.

However, when viewing the directory with a text editor directly all of them show basically all of the content. So both the filtered content (which is shown when opening the repo) and the inverted results (which remain hidden)

Why is this redundant data stored, and how can I get rid of this to reduce size ?

Thanks!

MartinLiebig · April 2016

Hi,

was your text set to polynominal or text? If you use text this should not happen. With polynominal RM is using a mapping table in the background. This mapping table is not cleaned up if you filter. You need to use Remove Unused to force this.

~Martin

kayman · April 2016

Thanks Martin, using the remove unused did the trick.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Repository size question

Answers