"Using Text Processing components from custom Java code"

jbartot · September 2010

Hi,

I have been successfully experimenting with the Text Processing components in Rapid Miner 5.0 IDE. I would like to use these components from my custom Java code (because I need to add custom features to the ExampleSet). I download the Text Processing javadocs and jar file. I am wondering if there is any example code I can reference to get started?

Thanks

Jay

land · September 2010

Hi,
unfortunately not. What are you trying to do? If you describe it in more detail, I might be able to give you some hints.

Greetings,
Sebastian

jbartot · September 2010

Below is the XML from a pure RM text-mining process I've had good success with. Now, I'd like to add my own computationally-derived features to the base set computed by RM and export the feature sets so they can be used in other machine learning packages.

I tracked down the XML file that maps attrs like "process_document_from_file" to classes (com.rapidminer.operator.text.io.FileDocumentInputOperator in this case) and then looked at the javadocs to see if I could stitch together the same process in Java code. Unfortunately, it is not clear to me how to construct a FileDocumentInputOperator (or OperatorDescription). Further, what is the best strategy for reproducing processes in code (e.g. is there a way to leverage the XML)?

I bought the "How to extend Rapid Miner 5.0" paper, but it wasn't much help.

Thanks

Jay

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
<description>Reads collections of text from a set of directories, assigning each directory to a class (as specified by parameter text_directories), and transforms them into a TF-IDF or other word vector. Finally, an SVM is applied to model the input texts.</description>
<process expanded="true" height="287" width="413">
<operator activated="true" class="text:process_document_from_file" compatibility="5.0.0" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
<list key="text_directories">
<parameter key="AIM" value="/Users/jb/Documents/workspace/MedlineDB/training_dir_subset/AIM"/>
<parameter key="CONCLUSION" value="/Users/jb/Documents/workspace/MedlineDB/training_dir_subset/CONCLUSION"/>
<parameter key="METHOD" value="/Users/jb/Documents/workspace/MedlineDB/training_dir_subset/METHOD"/>
<parameter key="RESULTS" value="/Users/jb/Documents/workspace/MedlineDB/training_dir_subset/RESULTS"/>
</list>
<process expanded="true" height="502" width="681">
<operator activated="true" class="text:transform_cases" compatibility="5.0.0" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
<operator activated="true" class="text:tokenize" compatibility="5.0.0" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.0.0" expanded="true" height="60" name="Stem (Snowball)" width="90" x="313" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.7" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.0.7" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="581" y="30"/>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.0" expanded="true" height="76" name="SVM" width="90" x="179" y="30">
<parameter key="kernel_type" value="linear"/>
<list key="class_weights"/>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_op="SVM" to_port="training set"/>
<connect from_op="SVM" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

land · September 2010

Hi,
why don't you just execute RapidMiner from command line? This is probably the cheapest way of getting around with this, if the process exports the data as csv at the end.
If you are working at a company, having some more advanced and complex IT infrastructure, you could incorporate RapidAnalyitics to export the results of your process as a webservice. This makes integration into existing frameworks very easy...

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Using Text Processing components from custom Java code"

Answers