Repository size question
around 10K articles in different languages, 80% are Latin based, 10% Greek based and 10% is Cyrillic
My process is using unicode range filtering to define the language of the content, and then stores the matching ones in a repository for furter handling.
so all articles identified as cyrillic go to a cyrillic repository,
all Greek articles go to a Greek repository
everything else goes to the latin repository.
So far so good, this was rather easy to accomplish but I noticed that all my repositories are (almost) equal in size, even if the cyrillic and Greek content account for about 10% of the content and should therefore be much smaller as the Latin one.
Loading the Greek repo only shows indeed Greek content, the Cyrillic only Cyrillic, so it seems ok.
However, when viewing the directory with a text editor directly all of them show basically all of the content. So both the filtered content (which is shown when opening the repo) and the inverted results (which remain hidden)
Why is this redundant data stored, and how can I get rid of this to reduce size ?