text mining pdf articles omitting references

mlubiczmlubicz Member, University Professor Posts: 17 University Professor
edited June 2019 in Help
In a previous post https://community.rapidminer.com/discussion/53107/text-mining-of-multiple-pdf-files-with-separate-key-word-counts an approach for mining multiple pdf files was described.
If the pdfs are articles, is there a way to exclude References section from being mined. The section often starts with the same term (i.e. 'References'), so I tried to define some Split or a specific Tokenize option but I failed.
I would be grateful for any suggestion.

Best Answers

  • mlubiczmlubicz Member, University Professor Posts: 17 University Professor
    Solution Accepted
    Thank you for the inspiration. In fact the task was to split each pdf document into main text and references, and make Text Mining on the main text only, while the references should be saved as an example set (e.g. xlsx) - a desirable by-product.
    I tried to experiment with Split File by Content and Split File by Point which makes the same, however it is more convenient to have one file and not multiple segments.
    sgenzer
Sign In or Register to comment.