Repository size question

kaymankayman Member Posts: 368   Unicorn
edited November 2018 in Help
Assume following process :

around 10K articles in different languages, 80% are Latin based, 10% Greek based and 10% is Cyrillic

My process is using unicode range filtering to define the language of the content, and then stores the matching ones in a repository for furter handling.

so all articles identified as cyrillic go to a cyrillic repository,
all Greek articles go to a Greek repository
everything else goes to the latin repository.

So far so good, this was rather easy to accomplish but I noticed that all my repositories are (almost) equal in size, even if the cyrillic and Greek content account for about 10% of the content and should therefore be much smaller as the Latin one.

Loading the Greek repo only shows indeed Greek content, the Cyrillic only Cyrillic, so it seems ok.

However, when viewing the directory with a text editor directly all of them show basically all of the content. So both the filtered content (which is shown when opening the repo) and the inverted results (which remain hidden)

Why is this redundant data stored, and how can I get rid of this to reduce size ?

Thanks!

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,132  RM Data Scientist
    Hi,

    was your text set to polynominal or text? If you use text this should not happen. With polynominal RM is using a mapping table in the background. This mapping table is not cleaned up if you filter. You need to use Remove Unused to force this.

    ~Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • kaymankayman Member Posts: 368   Unicorn
    Thanks Martin, using the remove unused did the trick.
Sign In or Register to comment.