Text Mining: analyse PDFs with a dictionary which has categories

nsmith · August 2020

Hello,

I want to analyse a number of PDFs (35) with kind of a dictionary. The output of the analysis should be an Excel File which shows how often every single word of the dictionary appears in the PDFs. Maybe it's important to know that the dictionary is not only a list of words. Instead the words are classified into five categories. Thus the analysis should give me information about how much is reported on the words of the dictionary and about which category is reported the most.

I already read lots of questions here and also watched tutorials, but I could not find exactly what I need. Trial and error didn't work as well up to now. Hope someone can help me.

Many thanks in advance,

Nina

MartinLiebig · August 2020

Hi,

this really depends on the format of your PDF. Did you try to just read one of them using the Read Document operator?

Best,

Martin

nsmith · August 2020

Yes, I read the PDFs with the Read Document Operator - that works. The problem is the dictionary. I'm not able to filter the PDFs with my dictionary (which consists of words in a excel file), so that I can see how often each word appears in the PDF. Furthermore I don't know how I can take account of the categories in my dictionary. Wheter there is a possibility that RapidMiner can recognize categories in a dictionary (maybe if for example each category is written in a new tab of my excel file) or if I need some additional operator for that.

Thanks for your help,
Nina

Telcontar120 · August 2020

It sounds like you want to use a specific wordlist and then count the words based on that wordlist (which are further grouped into 5 categories). You should be able to input your desired wordlist into the input port of the Process Documents operator. You can then use the Wordlist to Data operator on the resulting wordlist to turn it into a normal dataset that you can then summarize or use your grouping to do the category analysis.

nsmith · August 2020

Thanks for your answer @Telcontar120 !
Yes, you're right. I have a word list with key words (which are categorized) and want to scan all my PDFs for these words. Thus I only want to see this words and their occurence in the result view.
I tried your proposal, but I couldn't put the Wordlist into the input port and then connect with the process documents operator as an error occured. Furthermore I'm not sure where to add all my PDFs that should be analysed. Are both, the wordlist and the PDFs, set as an input for the process documents operator?

I hope my problem is not too confusing. Maybe it helps to have a look at the XML I posted before.

Telcontar120 · August 2020

@mschmitz is there a way to import a wordlist from an external file to be used as input for Process Documents? Or a relevant converter that can be used? Upon looking at the operator more closely, it seems like it is requiring a wordlist already in RapidMiner format, which normally can be generated only from another Process Documents operator. Of course it would be possible to work around this by putting the desired wordlist as text into one Process Documents operator merely to generate the wordlist to feed another Process Documents operator, but this seems somewhat inefficient and I am wondering if there is a more direct path.
@nsmith see my comments above regarding the wordlist input. It may be that you need to generate your wordlist first. Regarding the pdfs, you can use Process Documents from Files and then set your parameters to read your pdf files from your hard drive.

MartinLiebig · August 2020

Hi @Telcontar120, @nsmith ,

I think there is no way to generate a word list. Keep in mind that the wordlist contains also normalization factors for TF-IDF etc.But I think we can just do the full Occurrence matrix here and filter the attributes later for the ones we are interested? Alternatively you can just use Filter Token by Example Set in Process Documents.

Best,

Martin

Telcontar120 · August 2020

@mschmitz thanks, yes Filter Tokens by Exampleset should have the equivalent effect.

nsmith · September 2020

@mschmitz @Telcontar120 thank you very much for your answers, it's nearly working now!

Unfortunately there is still one problem with the "Filter Tokens Using ExampleSet" operator. I want to filter with my word list, which has two kinds of words.

Single words (like "digital")
Terms with two or more words (like "digital products")

In general it's working as I used the "Generate n-gramms" operator before. Thus all stand-alone words and terms I specified are in the result list. The problem is that the operator generates also terms, which I did not exactly mention in the word list. An example is "accelerating_digital". Even though I did not have this term in my word list, I want to have it in my result list as it contains the word "digital" (which is in my word list).

Is there a way to solve this problem?

Telcontar120 · September 2020

If you change the order of your operators you should be able to resolve this. You may need to redo some work in that you would filter the text using your word list first, then generate the resulting word vector, then use the Generate n-grams operator to build the combinations after that.

nsmith · September 2020

Thank you so much for your fast answer @Telcontar120 ! I tried a few possibilites and changed the operators, but it doesn't really work. I'm rather getting no result in the result list or I'm getting results but by proving them I realize that not every word which is in the word list and the text is shown in the result list.

I also tried to place the "generate n-gramms" operator at the end of the same "process documents" operator as the "filter tokens" operator is. Nothing really worked so far.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Mining: analyse PDFs with a dictionary which has categories

Answers