Text mining

Tuba · March 2020

hello . I am pursuing a master's degree in business information management at Mersin University. I will share my problem in detail on the link you sent I use rapid miner program in my thesis. but I encountered two problems. I have 4500 Turkish theses and 1500 articles. 150 pages each thesis. 150 * 4500. each article is 20 pages. 1500 * 20. I want to classify them with rapid miner. But since my thesis count is high, I cannot make this classification with the rapidminer and it constantly gives errors. How can I solve this problem. my pc i5 processor is 5 gb.

My second problem is that I want to use the most frequently used words in my thesis, but when I do stem (snowball) in Turkish, different words come out as well as the words are not reserved for their suffixes. so I can't use the stem and I get a lot of words with the same meaning. I cannot advance my thesis briefly. can you help me

MPB_ · March 2020

Hey @Tuba ,

For the first problem:

What error-messages do you get?

For the second problem:

Could you please share your process?

Tuba · March 2020

For the first problem

Tuba · March 2020

For the second problem

Tuba · March 2020

Let me briefly explain the subject of my thesis. I have articles and theses. I divided these theses and articles into three main themes. I divided each main theme into sub-themes. my goal here is to find out to what extent I made

these separations correctly. For this, I have classified with rapidminer.

MPB_ · March 2020

Hey @Tuba,

Thank you for sharing.

I have two Ideas for solving the first problem:

1. You could expand your ram using disk space see here:

https://www.youtube.com/watch?v=z_c60Osd__c

2. You divide the Input data into smaller parts or make a few sub-processes for preprocessing the data.

For solving the second problem it would be nice to see some of your data / results.

Have you tried it without stemming?

Maybe you will find some examples for turkish rapidMiner projects on the web?

Tuba · March 2020

thank You. I will try your suggestions for my 1st question.

Sorry, I couldn't find a Turkish Stemming. When I do it with little data, I see the table. Words are given separately with each attachment. did not perceive it as a word.

Marco_Barradas · March 2020

HI @Tuba on your image I can see you have over 7k attributes created I guess that makes your computer go nuts are you using any configuration for the prunning method?
One tip I can give you to share your configuration and process images try File--:Print/ExporImage then choose design and the export the image.
You can also use Loop Batches to reduce the amount of memory used while trying to process all your files.

Tuba · March 2020

thanks You. I know the rapidminer program as much as I watch it from the videos. do not know how to configure? 😔

I will try what you say second

Telcontar120 · March 2020

Just split your input files up into batches and process each batch separately. Then make sure you use one of the pruning options for generating the wordlist in the Process Documents from Data operator. This will keep everything within your available memory and solve your first problem. You can then combine all the resulting wordlists at the end.
I don't know enough about Turkish and stemming to have suggestions for the second issue, but perhaps google some questions related to stemming in Turkish and you may find some helpful resources.

Tuba · March 2020

thank you. I divided my data into 3 main themes and I divided them into 8 sub-theme. The article can be categorized because of its low number, but it cannot classify even 8 sub-themes in the sub-theme of the thesis. still can't find a solution

dedeer · March 2020

@Tuba
I am not very good at text mining with RM, feeling better in Python. Stem(snowball) works properly for English words actually. It may not be the best choice for Turkish. If your essays are Turkish, you may need to use Stem(Dictionary) which requires a document of patterns in Turkish. Normally there are good dictionaries for Turkish words that can be used in R/Python. You can search for one.

In your post on 25 March, you are showing the exa port result from your process document process. If you can connect the "wor" to res port you can see TF-IDF counts.
Also, this will give you an idea about your document so you can further transform your dataset.
I would recommend you Filter Tokens(by length) operator so you can cut many words at once after you examine the words table.

Image: https://us.v-cdn.net/6030995/uploads/editor/6n/k9wygk3dj5e3.png

Image: https://us.v-cdn.net/6030995/uploads/editor/1n/cb0f4s8ad8te.png

I just made an example set of 5 academic papers about Neural Network in Flood Forecasting.

Image: https://us.v-cdn.net/6030995/uploads/editor/uz/ppb6g2hqtquc.png

Is it something that can help you? You can further filter and select data for modeling.

Second, in order to classify documents, I missed the point about how you are planning to do it. Are you using meta-data like information to classify them? Or just processing the document and feeding the k_NN model? If you can tell me more about I may try to help you.

bests,
Deniz

Tuba · March 2020

first of all thank you very much. I can explain better if we can understand Turkish

I will try the model you suggested

I have divided the individual articles and theses into three main themes according to their topics. I divided the main themes into sub-themes. I applied k-nn after that

I don't know how to code. I don't know in python. I couldn't find stemming ready Turkish.

I'm doing a master's degree at Mersin University and stemming and the abundance of data have forced me so much

Tuba · March 2020

@dedeer

dedeer · March 2020

@Tuba
Sure we can discuss in Turkish, I write you a message.

Bests

Text mining

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories