Content analysis of annual corporate reports with text processing
I have a question regarding a text mining project I want to do for my master thesis.
I want to do a content analysis of corporate disclosure. So I want to train a model with an example set (excel list with representative sentences classified in one of 6 topic categories). After that I want to apply that model to several unknown annual reports (pdf format) of companies to measure how much they are disclosing regarding that 6 categories.
Now I am a little bit lost with choosing the right transforming processes for the annual report. I could tokenize the documents so I get a full list of sentences. But actually I don´t want every sentence to be categorized. I only want the model to measure how much of the content of each annual report refers to one of the 6 topics..
Do you have an idea or did somebody have a similar project?
Thanks and best regards,