Search Keywords from a file

7amritaarora7 · September 2016

Hi

I'm working on a project, wherein, I have to search a predetermined set of keywords. Further, this list of keywords gets updated regularly and is saved as a column in a database. Individually, I can search them using regex. But, is there a way, where I can search all the keywords mentioned in the file together?

Thanks in advance

Amrita

MartinLiebig · September 2016

Quick way missusing operators:

take your list of regexes, aggreagte them together using concat and apply all of them?

Or use a loop Values

~Martin

sgenzer · September 2016

Use "Filter Stopwords (Dictionary)" in the Text Processing extension? You may need to invert it depending on what you want to do.

7amritaarora7 · September 2016

Hi @mschmitz @sgenzer

Thanks for your replies.

I tried your solutions, this is the status now:
1. Concat

it shows the result of all attributes together, but, i need to know exactly, maybe in a separate column, which keyword was found and its frequency. Any help there?
+ This is a temporary solution, but the keywords updation takes place dynamically. So, any idea, how to search all the keywords stated in a file?

2. Filter Stopwords (dictionary)

this is the first solution, I also thought about. But, there isn't an option of invert selection in this operator. So, any other solution?

3. I'm trying loop values operator, but need further help in that

Thanks in advance

Amrita

sgenzer · September 2016

hmm...looking for an elegant solution. I know this sounds weird, but maybe try this:

- take your text and Split (by space or whatever will split up). This will create a ton of attributes.

- transpose this mess so that your text is listed word by word in one attribute and a ton of examples...sort of like this:

I

am

Scott

and

I

like

RapidMiner

- Do a Join (inner) with your keyword database list to see overlap

- Aggregate if desired to see frequencies

I do this more and more - create master "lookup" data sets and then join. It's quite versatile.

Scott

7amritaarora7 · September 2016

Hi @sgenzer

Thanks for your help and sorry for my late reply.
When I received your reply, I was also working on the same lines and with more help from your side and few hit and trials, some part of my process is done.
Now, last part that I am stuck with is searching the contents of one attribute (i. e. Keyword) in another attribute (i.e. text). For this, I tried using generate attribute and filter examples operators, but didn't get required results. Also regex based on matches and contains are not working since they search a particular word in one attribute, not an attribute in an attribute.

Any help here would be greatly appreciated!

Thanks in advance
Amrita

Thomas_Ott · October 2016

My initial thought to attacking this problem is this:

1. Read Database for the keywords and then use an Extract Macro set to Data Value. Give this macro a name "keyword_macro" and then this will extract the keyword list and associate it

2. Use a Loop with a Filter Examples embeded inside. Loop over the keywords and drop in the macro value into the Filter Examples. Use the "contains" filter and set it to the %{keyword_macro}

3. Then outside the loop use an Append operator to Append all the matching results.

Thomas_Ott · October 2016

Ok, try this. Just create a text file with a single column (see below) and then import it and save it as a repository.

Keywords

RapidMiner

Hadoop

Spark

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Keywords" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Personal/Community Answers/data/Keywords"/>
      </operator>
      <operator activated="true" class="rename" compatibility="7.2.002" expanded="true" height="82" name="Rename" width="90" x="179" y="34">
        <parameter key="old_name" value="att1"/>
        <parameter key="new_name" value="Keywords"/>
        <list key="rename_additional_attributes"/>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="34">
        <parameter key="macro" value="count_examples"/>
        <parameter key="attribute_name" value="Keywords"/>
        <parameter key="example_index" value="%{iteration}"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="loop" compatibility="7.2.002" expanded="true" height="82" name="Loop" width="90" x="447" y="34">
        <parameter key="set_iteration_macro" value="true"/>
        <parameter key="iterations" value="%{count_examples}"/>
        <process expanded="true">
          <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro (2)" width="90" x="112" y="34">
            <parameter key="macro" value="keywords"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="Keywords"/>
            <parameter key="example_index" value="%{iteration}"/>
            <list key="additional_macros"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="246" y="34">
            <parameter key="connection" value="Twitter Connection"/>
            <parameter key="query" value="RapidMiner"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.2.002" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Text.contains.%{keywords}"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Search Twitter" from_port="output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="7.2.002" expanded="true" height="82" name="Append" width="90" x="581" y="34"/>
      <connect from_op="Retrieve Keywords" from_port="output" to_op="Rename" to_port="example set input"/>
      <connect from_op="Rename" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
      <connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Update: Just swap out the Search Twitter operator for the data store of the strings you want to search.

7amritaarora7 · October 2016

Hi @Thomas_Ott

I tried this with my database, but, I'm not getting the required results. I'm getting parts of text as result. What I need in result is which keyword is appearing in the text and how many times.

The other way, that I think, it will work is using loop value operator extracting all values of the keyword attribute from the database. Then, within loop value, add generate attribute operator with a regex for searching keyword macro within text attribute. Is this correct way? If yes, need some help with regex.

Or is there any other way?

Thanks in advance

Regards

Amrita

Thomas_Ott · October 2016

Hi @7amritaarora7,

I did something like this for a customer once but I can't seem to find the process now. I think with the attached process it's getting close, just have to figure out how to properly select the columns.

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Keywords" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Personal/Community Answers/data/Keywords"/>
      </operator>
      <operator activated="true" class="rename" compatibility="7.2.002" expanded="true" height="82" name="Rename" width="90" x="179" y="34">
        <parameter key="old_name" value="att1"/>
        <parameter key="new_name" value="Keywords"/>
        <list key="rename_additional_attributes"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.2.002" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
        <list key="function_descriptions">
          <parameter key="Keywords" value="lower(Keywords)"/>
        </list>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro" width="90" x="447" y="34">
        <parameter key="macro" value="count_examples"/>
        <parameter key="attribute_name" value="Keywords"/>
        <parameter key="example_index" value="%{iteration}"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="loop" compatibility="7.2.002" expanded="true" height="82" name="Loop" width="90" x="581" y="34">
        <parameter key="set_iteration_macro" value="true"/>
        <parameter key="iterations" value="%{count_examples}"/>
        <process expanded="true">
          <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro (2)" width="90" x="112" y="34">
            <parameter key="macro" value="keywords"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="Keywords"/>
            <parameter key="example_index" value="%{iteration}"/>
            <list key="additional_macros"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="246" y="34">
            <parameter key="connection" value="Twitter Connection"/>
            <parameter key="query" value="RapidMiner"/>
            <parameter key="limit" value="1000"/>
            <parameter key="language" value="en"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.2.002" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Text.contains.%{keywords}"/>
            </list>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="7.2.002" expanded="true" height="82" name="Select Attributes" width="90" x="514" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="Text"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="7.2.002" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="34">
            <list key="function_descriptions">
              <parameter key="Keyword" value="%{keywords}"/>
            </list>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="7.2.002" expanded="true" height="82" name="Nominal to Text" width="90" x="782" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Text"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="916" y="34">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
              <operator activated="true" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="34"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_role" compatibility="7.2.002" expanded="true" height="82" name="Set Role" width="90" x="1050" y="34">
            <parameter key="attribute_name" value="Keyword"/>
            <parameter key="target_role" value="id_1"/>
            <list key="set_additional_roles"/>
          </operator>
          <connect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Search Twitter" from_port="output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="false" class="append" compatibility="7.2.002" expanded="true" height="68" name="Append" width="90" x="715" y="238"/>
      <connect from_op="Retrieve Keywords" from_port="output" to_op="Rename" to_port="example set input"/>
      <connect from_op="Rename" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
      <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

In this particular process I added in the Process Documents from Data operator and set it to Term Occurances. It definately gets all the occurences for RapidMiner/Hadoop/Spark, but the problem becomes when I have the term "stratahadoop." Maybe you can take it from here and experiment, I'm tied up for the rest of the week.

Update: Do a search on the forum for this. http://community.rapidminer.com/t5/RapidMiner-Studio/SOLVED-Simple-word-count-of-wordlist-from-document/m-p/25054#M18530

7amritaarora7 · October 2016

Hi @Thomas_Ott

Thanks a lot for this process. This seems really close to the results that I want. I'll work on it from here and let you know, when the perfect solution is found.

Regards

Amrita

7amritaarora7 · October 2016

@Thomas_Ott, @mschmitz & @sgenzer

Hi

Good News! I was able to do that keyword search process by regularly updating the database using Execute SQL operator and then going through all keywords via loop values operator.

Some Sad News! As I am using a previous version of rapidminer, so, the process suggested by @Thomas_Ott for frequency count of keywords could not work. So, I'm trying to create a script for findig keyword frequency and using the Execute Script operator. The issue with this is that there's some error regarding the output transfer.Execute script is unable to show output when connected to results.

Is it the right way? or is there any other way?

P.s. : I'm using rapidminer version 5.3

Thanks in advance

Regards

Amrita

MartinLiebig · October 2016

Dear Amrita,

if you can provide an example process, i would be happy to have a look at your Exec. Script.

~Martin

7amritaarora7 · October 2016

Hi Martin

I'm attaching here the process that I'm testing using execute script:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="execute_script" compatibility="5.3.015" expanded="true" height="76" name="Execute Script" width="90" x="45" y="75">
        <parameter key="script" value="    import java.io.*;&#10;     &#10;      BufferedReader br=new BufferedReader(new InputStreamReader(System.in));&#10;     //   System.out.println(&quot;Enter the String: &quot;);&#10;        String s= &quot;where are we going. who are you. we we we &quot;;&#10;       // System.out.println(&quot;Enter substring: &quot;);&#10;        String KeyWords=&quot;here, are, we&quot;;&#10;        &#10;        String[] splitString=KeyWords.split(&quot;,&quot;);&#10;        int ind,count=0;&#10;        for(String subString :splitString)&#10;        {&#10;        &#9;   for(int i=0; i+subString.length()&lt;=s.length(); i++) &#10;               {&#10;        &#9;&#9;   &#10;                   ind=s.indexOf(subString,i);&#10;                   if(ind&gt;=0)&#10;                   {&#10;                       count++;&#10;                       i=ind;&#10;                       ind=-1;&#10;                   }&#10;               }&#10;               &#10;        &#9;//  System.out.println(&quot;Occurence of '&quot;+subString+&quot;' in String is &quot;+count);&#10;        &#9; return &quot;Occurence of '&quot;+subString+&quot;' in String is &quot;+count&#10;        }&#10;     return"/>
      </operator>
      <connect from_port="input 1" to_op="Execute Script" to_port="input 1"/>
      <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Please have a look at it. When I run this process, this is the error I'm getting in log:

WARNING: Unknown result: class java.lang.String: Occurence of 'here' in String is 1

Regards

Amrita

Thomas_Ott · October 2016

Hi Amrita,

Couldn't you just upgrade to version 7.2? Version 5.3 is really really old and you miss A LOT of great new performance/operator/extension enhancements.

7amritaarora7 · October 2016

Hi @Thomas_Ott

Yes, 7.2 is a much better version and I use that too. But, for server, I have complete access to 5.3 version. The newest server has some limitations, unless its purchased. So, by the time, my company purchases that, I have to use the older version.

If there's a way to connect, Rapidminer 7.2 to Rapid Analytics server, do let me know

Regards

Amrita

Thomas_Ott · October 2016

Hi Amrita,

Unfortunately, you can't connect Studio 5.3 to Server 7.2. We offer a FREE version of RapidMiner Server now but it has limitations of 1000 API calls, 2 GB of memory, and 1 logical core. If your employer is using RapidAnalytics, and get value ($$$) from it, it would be great if you guys upgraded so we can continue to innovate!

7amritaarora7 · October 2016

Hi @Thomas_Ott

We have already planned to buy that within next few months, but, for now, I'm looking for an interim solution.

Regards

Amrita

MartinLiebig · October 2016

Amrita,

the attached code works for me. I do not know what i really did . I added the LogService. Those messages are visible in the Log Panel of RM. If this is what you would like to have, i would like to built it with two Example set inputs (1. Texts, 2. List of Keywords) and one example set out puts (input + count_XXX)

Is our sales team already in contact with you? If no, please reach out to me at mschmitz at rapidminer dot com. We might simply help you to get your use case implemented to get the business convinced.

~Martin

import java.io.*;
import com.rapidminer.tools.LogService;

import java.util.logging.Level

BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
 //   System.out.println("Enter the String: ");
String s= "where are we going. who are you. we we we ";
   // System.out.println("Enter substring: ");
String KeyWords="here, are, we";

String[] splitString=KeyWords.split(",");

int ind,count=0;
for(String subString :splitString)
    {
        count=0; // everything else makes no sense
        LogService.root.log(Level.INFO,subString)
         for(int i=0; i+subString.length()<=s.length(); i++) 
           {

               ind=s.indexOf(subString,i);
               //LogService.root.log(Level.INFO,ind)
               if(ind>=0)
               {
                   count++;
                   i=ind;
                   ind=-1;
               }
           }
                   LogService.root.log(Level.INFO,"Occurence of "+subString+" in String is "+String.valueOf(count))

       //return "Occurence of '"+subString+"' in String is "+count
    }
 return

7amritaarora7 · October 2016

Hi Martin

Thanks a lot for this. This works great. Now, I'll take it from here and join it to the main process. A small clarification: the results that are shown in logs, can be transferred to database or any other form of output, right?
And yes, I'm in contact with your sales team.

Thanks again

Regards

Amrita

MartinLiebig · October 2016

Amrita,

sure. I will have a look tomorrow.Takes a few minute to get it into a RM example set.

~Martin

MartinLiebig · October 2016

Amrita,

attached is a process with this script which has proper in and output ports. It's not yet commented. If you need any help to understand, just post here.

~Martin

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="7.2.003" expanded="true" height="82" name="Subprocess" width="90" x="112" y="289">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
            <list key="attribute_values">
              <parameter key="Keywords" value="&quot;here, are, we&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="split" compatibility="7.2.003" expanded="true" height="82" name="Split" width="90" x="179" y="34"/>
          <operator activated="true" class="transpose" compatibility="7.2.003" expanded="true" height="82" name="Transpose" width="90" x="313" y="34"/>
          <operator activated="true" class="rename" compatibility="7.2.003" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
            <parameter key="old_name" value="att_1"/>
            <parameter key="new_name" value="Keywords"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_op="Transpose" to_port="example set input"/>
          <connect from_op="Transpose" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Get the keyword list</description>
      </operator>
      <operator activated="true" class="subprocess" compatibility="7.2.003" expanded="true" height="82" name="Subprocess (2)" width="90" x="112" y="34">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="34">
            <list key="attribute_values">
              <parameter key="Text" value="&quot;where are we going. who are you. we we we&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="45" y="136">
            <list key="attribute_values">
              <parameter key="Text" value="&quot;where are we going.&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="103" name="Append" width="90" x="246" y="34"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Get Test Data</description>
      </operator>
      <operator activated="true" class="execute_script" compatibility="7.2.003" expanded="true" height="103" name="Execute Script" width="90" x="246" y="85">
        <parameter key="script" value="import java.io.*;&#10;import com.rapidminer.tools.LogService;&#10;import com.rapidminer.tools.Ontology;&#10;import java.util.logging.Level&#10;&#10;// Configs for the user&#10;String keywordAttributeName = &quot;Keywords&quot;&#10;String testAttributeName = &quot;Text&quot;&#10;//---------------------------------&#10;&#10;BufferedReader br=new BufferedReader(new InputStreamReader(System.in));&#10;&#10;ExampleSet inputSet = input[0];&#10;ExampleSet keywordlist = input[1];&#10;&#10;Attribute keywordAttribute = keywordlist.getAttributes().get(keywordAttributeName);&#10;&#10;Attribute textAttribute = inputSet.getAttributes().get(testAttributeName);&#10;&#10;List&lt;String&gt; splitString = new ArrayList&lt;String&gt;();&#10;// mh, if there is more than one, concat?&#10;for (Example e : keywordlist){&#10;  LogService.root.log(Level.INFO,e.getNominalValue(keywordAttribute))&#10;  splitString.add(e.getNominalValue(keywordAttribute))&#10;}&#10;&#10;ExampleTable inputTable = inputSet.getExampleTable();&#10;&#10;//String[] splitString=KeyWords.split(&quot;,&quot;);&#10;int numberOfKeywords = splitString.size();&#10;&#10;Attribute[]  outputAttributes = new Attribute[numberOfKeywords];&#10;int k = 0;&#10;for (String subString :splitString){&#10; outputAttributes[k] = AttributeFactory.createAttribute(&quot;count_&quot;+subString, Ontology.INTEGER);&#10; inputTable.addAttribute(outputAttributes[k]);&#10;inputSet.getAttributes().addRegular(outputAttributes[k]);&#10;  k++;&#10;}&#10;&#10;int ind,count=0;&#10;String s;&#10;&#10;for(Example e : inputSet){&#10;  s = e.getNominalValue(textAttribute);&#10;  k = 0;&#10;  for(String subString :splitString)&#10;      {&#10;          count=0; // everything else makes no sense&#10;          LogService.root.log(Level.INFO,subString)&#10;           for(int i=0; i+subString.length()&lt;=s.length(); i++) &#10;             {&#10;             &#10;                 ind=s.indexOf(subString,i);&#10;                 //LogService.root.log(Level.INFO,ind)&#10;                 if(ind&gt;=0)&#10;                 {&#10;                     count++;&#10;                     i=ind;&#10;                     ind=-1;&#10;                 }&#10;             }&#10;            e.setValue(outputAttributes[k], count);&#10;            ++k;&#10;            LogService.root.log(Level.INFO,&quot;Occurence of &quot;+subString+&quot; in String is &quot;+String.valueOf(count))&#10;&#10;    }&#10;  }&#10;&#10; return inputSet"/>
      </operator>
      <connect from_op="Subprocess" from_port="out 1" to_op="Execute Script" to_port="input 2"/>
      <connect from_op="Subprocess (2)" from_port="out 1" to_op="Execute Script" to_port="input 1"/>
      <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

7amritaarora7 · October 2016

Hi Martin

Thanks a ton. This is exactly what I was trying to do.

Thanks again

Regards

Amrita

simon_kuehne · January 2018

Hi,

I had the same problem and your soultion worked well for me! Thanks!

In my case I do not want to find only keywords as some keywords are within hashtags, e.g. #ILikeThatKeyword.

Is there a solution to find also those matches?

Thanks!

Simon

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Search Keywords from a file

Answers