"Filtering by term frequency"

samtpfotesamtpfote Member Posts: 1 Contributor I
edited June 2019 in Help
Hello everybody,

I would like to get all Terms of a html-collection that appear in more than 99% of the documents.

But how can I:
  -  get the number of documents in my collection and
  -  caluclate the value #Term (in documents )/#documents?

It would be really great if someone could help me!
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hello samtpfote,

    you can use Wordlist to Data to convert the wordlist output of Process Documents to a dataset. Then you can be creative with Generate Attributes and Filter Examples to generate/extract all the information that you need.

    The total number of documents corresponds the the number of examples in the exa output of Process Documents. You can extract that number into a macro with the Extract Macro operator.

    If you have further questions, please come back!

    All the best,
    Marius
Sign In or Register to comment.