RapidMiner

RM Certified Expert
RM Certified Expert

Re: K-means clustering over 8000 text file

Ok a few things you should try to make this more manageable. For testing purposes, put a Sample operator right after the EXA port from the Process Documents file. The default value 100 rows. Use that for the time being. 

 

Next, make sure you toggle on Pruning on the Process Documents from file. I typically use the Percentual one with the default values of 3% and 30%. This should take a lot of junk out the text documents. I would even go further and use a Filter Tokens inside the Process Documents operator. 

 

Start small and work up from there. 

Contributor II eman_alahmadi
Contributor II

Re: K-means clustering over 8000 text file

for first step should be like this 

rapid.png



Thomas_Ott wrote:

 

 

Next, make sure you toggle on Pruning on the Process Documents from file. I typically use the Percentual one with the default values of 3% and 30%. This should take a lot of junk out the text documents. I would even go further and use a Filter Tokens inside the Process Documents operator. 

 

Start small and work up from there. 


The line marked by red color --- what you mean about it?how can I toggle  on pruning ???
thank you for your help, regard. 
RM Certified Expert
RM Certified Expert

Re: K-means clustering over 8000 text file

Pruning is toggled on in the Process Documents from Files operator. There a parameter called "Prune Method," enable that and select Precentural.

 

You should confirm how wide your data set gets after your Text Process. This is likely the problem. 

Contributor II eman_alahmadi
Contributor II

Re: K-means clustering over 8000 text file

hello, please can I use any operator at first to remove any word other than English?. Because of a lot of tags in my text files in deferent language. So in this way could help to reduce the size of files.

 

Regards.

RM Certified Expert
RM Certified Expert

Re: K-means clustering over 8000 text file

Are the files you load into the Process Documents from Files operator a mix of English and non English? If som just seperate out the non English ones and run again.  Unless there is some metadata that can be extracted that will give you the "lang = en" contained in your texts, there is no easy way I know of of doing it. 

 

Some possible workarounds are maybe using the NameSor extension or even the Rosette extension, there might be some auto-language support in them.

Contributor II eman_alahmadi
Contributor II

Re: K-means clustering over 8000 text file

yes, my files have a mixed language. But I used a script now to remove non-English words. Now I concern about another thing after I used the k-means rapidminor  and get the output, Can I use the source code of the output to transform it to specific format in a text file like this:

# 0
@ 192 100886.txt
@ 814 1034.txt
@ 988 1042.txt
@ 1854 107663.txt
@ 1961 1081.txt
@ 2011 1084.txt
@ 2082 1086.txt
@ 2188 1090.txt# 0
@ 192 100886.txt
@ 814 1034.txt
@ 988 1042.txt
.........

and so on  where the # refer to the num oof cluster and @ refer to the text file.

 

 

Regard. 

Contributor II eman_alahmadi
Contributor II

Re: K-means clustering over 8000 text file

hello, after running the script of removing all non-English words and remove all numbers and punctuations I have a folder with size much smaller, half of the previous may be. The folder contains 8000 text files. Which operator enough to run k-means clustering over these files. I think I have to use Process Documents from Files” operator ( inside this operator drag "Tokenize" operator and “Transform Cases” operator) and k-Means operator. 

 

Wait for the response, all regards. 

RM Certified Expert
RM Certified Expert

Re: K-means clustering over 8000 text file

You can use the Extract Cluster Prototypes operater to conver the results and save them as an exampleset.

RM Certified Expert
RM Certified Expert

Re: K-means clustering over 8000 text file

I don't understand your question? It sounds like you have a process that will text process your data and then cluster it afterwards.

Contributor II eman_alahmadi
Contributor II

Re: K-means clustering over 8000 text file

I mean can I used the source code of the out put and save the result in text files as I want. Regards.
Polls
How can RapidMiner increase participation in our new competitions?
Twitter Feed