I need help building taxonomies from large number of documents

boatanchorguyboatanchorguy Member Posts: 3 Contributor I
edited November 2018 in Help
Thank you to Marius for the Read Before Posting instructions. Following his suggestions,

1. Describe what you are doing.
I need to build many taxonomies from a large number of documents.

2. If you are working with data, give a detailed description of your data (number of examples and attributes, attribute types, label type etc.).
I did enormous amounts of searches over months and now have several thousand documents I need to process, mostly pdf, some msword, excel, & ppt.

3. Describe which results or actions you are expecting.
I need good clean taxonomies, for many different topics. I am hoping to set up a proper method using Rapidminer, but there does not seem to be an obvious pathway to do this.

Ideally, for each topic, the method would a) pre-process the documents, filtering for such items as the proper word or key phrase in the title, or the abstract; b) assembling the filtered documents; c) (optional) extracting tables of contents, indices, glossaries, etc.; d) extracting and amalgamating the sub-topics appropriate to the particular topic; e) generating the taxonomy.

I am new to Rapidminer, and relatively new to data mining in general, so please keep it simple for me.

Please help me understand any and all methods I could use to accomplish this.

And please let me know if I am following the proper procedures for this forum, or how I can improve this post.

Thank you very much.

Sign In or Register to comment.