Text Classification using Text Plugin - StringTextInput

Contributor II

Text Classification using Text Plugin - StringTextInput

This post refers to http://rapid-i.com/rapidforum/index.php/topic,368.0.html. It adresses the problems I experienced related to the StringTextInput operator.

First I'm loading the texts and their labels from a MySQL database using DatabaseExampleSource. I'd like to save the obtained example set before continuing.

Problem 1: The examples have a string attribute (the text to classify) which usually contains newlines. When writing this to disk using ExampleSetWriter an example is split up to several lines. So ExampleSource doesn't work (it expects one example per line). What can I do?

After loading the data from the database I use StringTextInput.

Problem 2: StringTextInput throws a warning. For every example in the example set it prints out the content of string attribute (e.g. the text to classify) followed by "not found. Assuming the text is directly encoded as document source..." I think this means the string attribute is interpreted as a filename before it is used directly. Since I got spammed with warnings I had to suppress output of warnings completely. Is there a better solution?

Question 3: What does the parameter "prune above" of the operator StringTextInput do when I enter a percentage value? I didn't understand the explanation in the operator description.

Next I need to create a wordlist. Since the database contains a lot of articles I do not want to load them all at once into memory.

Problem 4: How can I modify StringTextInput so that I can load a wordlist and update it with new words? I tried to find the part in the sourcecode and noticed that wordlist creation is handled by WVTool Java library (not Text Plugin). But the Text Plugin seems to use a newer version of WVTool (given as .jar) than I can get via Sourceforge. Where can I get the sourcecode of the newest version of WVTool?