Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
K-means clustering over 8000 text file
eman_alahmadi
Member Posts: 12 Contributor I
hi, I'm new to use this platform. I want to use k-means to cluster 8000 text file that contains tags of 8000 image, if it possible to use rapidminer or not? and if it's possible what is the suitable K and max runs should be chosen?
Regard
0
Answers
Yes you can do that with RapidMiner but just to be sure, the texts don't contain actual images? like jpgs or pngs? If you want to do image mining you have to install the Image Mining extension.
W.R.T. to the # of optimal clusters. I usually use X-means to figure that out automatically.
Here's a sample process that will get you started. You will need to install the Text Mining extension to do this.
hello, thank you for replay.
I'm starting by installing the text proccing operator from extension nad update, then drag the "Process Documents from Files" operator and I “Edit List” beside the "text directories” label in order to choose the files that I use to run the clustering algorithm on it. Then I open the “Process Documents from Files” operator (by double click ) to Insert the “Extract Content” operator into the Main Process. after that drag "Tokenize" operator into the "Process Documents from Files" process after the “Extract Content” operator. Then I get out of the “Process Documents from Files” process. And use the standard k-Means algorithm by dragging into the Main Process frame after the “Process Documents from Files” operator. I was set the K= 89 and max runs=8000. finally when I press “Play” button and this take until now 5:16 hours and not finish yet? I don't know if it's OK ? and why the run does not finish yet?
for "W.R.T. to the # of optimal clusters. I usually use X-means to figure that out automatically" can you explain to me how can I figure it?
Best Regard.
This the screenshot could you help me please??
It's quite possibly that it could take 10 hours, hard to fathom without knowing how wide your dataset got from the Text Processing. I would consider doing Pruning and getting your data set all text processed before you do the Clustering, this way you can speed up the process. Why do you need 89 clusters anyway?
You mean it's good to do text processing first then clustering?
I chose k=89 this the closest to the square root of 8000. Would you mind to tell me how I can choose it automatically? And in case the laptop restarted the run start again or complete from the previous one.
Many thanx.
Just use the x-means operator and set the k limits. the default has a min of 2 and a max of 60.
What I would do is put a Store operator right after the EXA port of the Process Documents from Files operator. This way you can save the processed text and inspect it. You could also try a Sample operator to take a random sample of maybe 500 rows to see how long it would take to process then.
In cases like this we usually suggest you use a RapidMiner Server on a dedicated box with lots of memory and cores. Of course that pre-supposes that you have a license that will unlock the cores and memory on the Server.
still the output not appear yet ?? is that possible ??
Based on it being 2% done, you'll have to wait about 98 days for it to finish.
You must have a very very wide data set. Did you try the sampling as I proposed. You might have to do some heavy pruning of your text files too.
Really no, I'm a beginner in that. Would you mind to explain to me the steps to use the samples?
Thanx in advanced.
Ok a few things you should try to make this more manageable. For testing purposes, put a Sample operator right after the EXA port from the Process Documents file. The default value 100 rows. Use that for the time being.
Next, make sure you toggle on Pruning on the Process Documents from file. I typically use the Percentual one with the default values of 3% and 30%. This should take a lot of junk out the text documents. I would even go further and use a Filter Tokens inside the Process Documents operator.
Start small and work up from there.
for first step should be like this
Pruning is toggled on in the Process Documents from Files operator. There a parameter called "Prune Method," enable that and select Precentural.
You should confirm how wide your data set gets after your Text Process. This is likely the problem.
hello, please can I use any operator at first to remove any word other than English?. Because of a lot of tags in my text files in deferent language. So in this way could help to reduce the size of files.
Regards.
Are the files you load into the Process Documents from Files operator a mix of English and non English? If som just seperate out the non English ones and run again. Unless there is some metadata that can be extracted that will give you the "lang = en" contained in your texts, there is no easy way I know of of doing it.
Some possible workarounds are maybe using the NameSor extension or even the Rosette extension, there might be some auto-language support in them.
yes, my files have a mixed language. But I used a script now to remove non-English words. Now I concern about another thing after I used the k-means rapidminor and get the output, Can I use the source code of the output to transform it to specific format in a text file like this:
# 0
@ 192 100886.txt
@ 814 1034.txt
@ 988 1042.txt
@ 1854 107663.txt
@ 1961 1081.txt
@ 2011 1084.txt
@ 2082 1086.txt
@ 2188 1090.txt# 0
@ 192 100886.txt
@ 814 1034.txt
@ 988 1042.txt
.........
and so on where the # refer to the num oof cluster and @ refer to the text file.
Regard.
hello, after running the script of removing all non-English words and remove all numbers and punctuations I have a folder with size much smaller, half of the previous may be. The folder contains 8000 text files. Which operator enough to run k-means clustering over these files. I think I have to use Process Documents from Files” operator ( inside this operator drag "Tokenize" operator and “Transform Cases” operator) and k-Means operator.
Wait for the response, all regards.
You can use the Extract Cluster Prototypes operater to conver the results and save them as an exampleset.
I don't understand your question? It sounds like you have a process that will text process your data and then cluster it afterwards.
Regards.
You can use Write CSV (you can config to write a txt) or Write File.
when I double click on an object that is displayed in the Folder View of a cluster model: "No visualization available for an object with id 1,020,793!". It is the same result for any item of the folder view. How can I solve this, please ???
Regard
That just means the data it's trying to visualize doesn't lend itself to visualization. Do you need this for a reason?