Search Keywords from a file

7amritaarora77amritaarora7 Member Posts: 25 Contributor I
edited November 2018 in Help

Hi

 

I'm working on a project, wherein, I have to search a predetermined set of keywords. Further, this list of keywords gets updated regularly and is saved as a column in a database. Individually, I can search them using regex. But, is there a way, where I can search all the keywords mentioned in the file together?

 

Thanks in advance

Amrita

 

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,126  RM Data Scientist

    Quick way missusing operators:

     

    take your list of regexes, aggreagte them together using concat and apply all of them? :)

     

    Or use a loop Values


    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,440  Community Manager

    Use "Filter Stopwords (Dictionary)" in the Text Processing extension?  You may need to invert it depending on what you want to do.

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi @mschmitz @sgenzer

     

    Thanks for your replies.

    I tried your solutions, this is the status now:
    1. Concat

    it shows the result of all attributes together, but, i need to know exactly, maybe in a separate column, which keyword was found and its frequency. Any help there?
    + This is a temporary solution, but the keywords updation takes place dynamically. So, any idea, how to search all the keywords stated in a file?

    2. Filter Stopwords (dictionary)

    this is the first solution, I also thought about. But, there isn't an option of invert selection in this operator. So, any other solution?

    3. I'm trying loop values operator, but need further help in that

     

    Thanks in advance

    Amrita

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,440  Community Manager

    hmm...looking for an elegant solution.  I know this sounds weird, but maybe try this:

     

    - take your text and Split (by space or whatever will split up).  This will create a ton of attributes.

    - transpose this mess so that your text is listed word by word in one attribute and a ton of examples...sort of like this:

     

    I

    am

    Scott

    and

    I

    like

    RapidMiner

     

    - Do a Join (inner) with your keyword database list to see overlap

    - Aggregate if desired to see frequencies

     

    I do this more and more - create master "lookup" data sets and then join.  It's quite versatile.

     

    Scott

     

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I
    Hi @sgenzer

    Thanks for your help and sorry for my late reply.
    When I received your reply, I was also working on the same lines and with more help from your side and few hit and trials, some part of my process is done.
    Now, last part that I am stuck with is searching the contents of one attribute (i. e. Keyword) in another attribute (i.e. text). For this, I tried using generate attribute and filter examples operators, but didn't get required results. Also regex based on matches and contains are not working since they search a particular word in one attribute, not an attribute in an attribute.

    Any help here would be greatly appreciated!

    Thanks in advance
    Amrita
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    My initial thought to attacking this problem is this:

     

    1. Read Database for the keywords and then use an Extract Macro set to Data Value. Give this macro a name "keyword_macro" and then this will extract the keyword list and associate it 

    2. Use a Loop with a Filter Examples embeded inside. Loop over the keywords and drop in the macro value into the Filter Examples. Use the "contains" filter and set it to the  %{keyword_macro}

    3. Then outside the loop use an Append operator to Append all the matching results.

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    Ok, try this.  Just create a text file with a single column (see below) and then import it and save it as a repository. 

     

    Keywords

    RapidMiner

    Hadoop

    Spark

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Keywords" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Personal/Community Answers/data/Keywords"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.2.002" expanded="true" height="82" name="Rename" width="90" x="179" y="34">
    <parameter key="old_name" value="att1"/>
    <parameter key="new_name" value="Keywords"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="34">
    <parameter key="macro" value="count_examples"/>
    <parameter key="attribute_name" value="Keywords"/>
    <parameter key="example_index" value="%{iteration}"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="loop" compatibility="7.2.002" expanded="true" height="82" name="Loop" width="90" x="447" y="34">
    <parameter key="set_iteration_macro" value="true"/>
    <parameter key="iterations" value="%{count_examples}"/>
    <process expanded="true">
    <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro (2)" width="90" x="112" y="34">
    <parameter key="macro" value="keywords"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="Keywords"/>
    <parameter key="example_index" value="%{iteration}"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="246" y="34">
    <parameter key="connection" value="Twitter Connection"/>
    <parameter key="query" value="RapidMiner"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.2.002" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Text.contains.%{keywords}"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
    <connect from_op="Search Twitter" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="7.2.002" expanded="true" height="82" name="Append" width="90" x="581" y="34"/>
    <connect from_op="Retrieve Keywords" from_port="output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
    <connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Update: Just swap out the Search Twitter operator for the data store of the strings you want to search. 

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi @Thomas_Ott

     

    I tried this with my database, but, I'm not getting the required results. I'm getting parts of text as result. What I need in result is which keyword is appearing in the text and how many times.

    The other way, that I think, it will work is using loop value operator extracting all values of the keyword attribute from the database. Then, within loop value, add generate attribute operator with a regex for searching keyword macro within text attribute. Is this correct way? If yes, need some help with regex.

    Or is there any other way?

    Thanks in advance

     

    Regards

    Amrita

    robin
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    Hi @7amritaarora7,

     

    I did something like this for a customer once but I can't seem to find the process now. I think with the attached process it's getting close, just have to figure out how to properly select the columns.

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Keywords" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Personal/Community Answers/data/Keywords"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.2.002" expanded="true" height="82" name="Rename" width="90" x="179" y="34">
    <parameter key="old_name" value="att1"/>
    <parameter key="new_name" value="Keywords"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.2.002" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
    <list key="function_descriptions">
    <parameter key="Keywords" value="lower(Keywords)"/>
    </list>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro" width="90" x="447" y="34">
    <parameter key="macro" value="count_examples"/>
    <parameter key="attribute_name" value="Keywords"/>
    <parameter key="example_index" value="%{iteration}"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="loop" compatibility="7.2.002" expanded="true" height="82" name="Loop" width="90" x="581" y="34">
    <parameter key="set_iteration_macro" value="true"/>
    <parameter key="iterations" value="%{count_examples}"/>
    <process expanded="true">
    <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro (2)" width="90" x="112" y="34">
    <parameter key="macro" value="keywords"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="Keywords"/>
    <parameter key="example_index" value="%{iteration}"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="246" y="34">
    <parameter key="connection" value="Twitter Connection"/>
    <parameter key="query" value="RapidMiner"/>
    <parameter key="limit" value="1000"/>
    <parameter key="language" value="en"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.2.002" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Text.contains.%{keywords}"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.2.002" expanded="true" height="82" name="Select Attributes" width="90" x="514" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Text"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.2.002" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="34">
    <list key="function_descriptions">
    <parameter key="Keyword" value="%{keywords}"/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.2.002" expanded="true" height="82" name="Nominal to Text" width="90" x="782" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="916" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.2.002" expanded="true" height="82" name="Set Role" width="90" x="1050" y="34">
    <parameter key="attribute_name" value="Keyword"/>
    <parameter key="target_role" value="id_1"/>
    <list key="set_additional_roles"/>
    </operator>
    <connect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
    <connect from_op="Search Twitter" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="false" class="append" compatibility="7.2.002" expanded="true" height="68" name="Append" width="90" x="715" y="238"/>
    <connect from_op="Retrieve Keywords" from_port="output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
    <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    In this particular process I added in the Process Documents from Data operator and set it to Term Occurances. It definately gets all the occurences for RapidMiner/Hadoop/Spark, but the problem becomes when I have the term "stratahadoop."  Maybe you can take it from here and experiment, I'm tied up for the rest of the week.

     

    Update: Do a search on the forum for this. http://community.rapidminer.com/t5/RapidMiner-Studio/SOLVED-Simple-word-count-of-wordlist-from-document/m-p/25054#M18530

     

    robin
  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi @Thomas_Ott

     

    Thanks a lot for this process. This seems really close to the results that I want. I'll work on it from here and let you know, when the perfect solution is found.

     

    Regards

    Amrita

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    @Thomas_Ott@mschmitz & @sgenzer

     

    Hi

     

    Good News! I was able to do that keyword search process by regularly updating the database using Execute SQL operator and then going through all keywords via loop values operator.

    Some Sad News! As I am using a previous version of rapidminer, so, the process suggested by @Thomas_Ott for frequency count of keywords could not work. So, I'm trying to create a script for findig keyword frequency and using the Execute Script operator. The issue with this is that there's some error regarding the output transfer.Execute script is unable to show output when connected to results.

    Is it the right way? or is there any other way?

    P.s. : I'm using rapidminer version 5.3

    Thanks in advance

    Regards

    Amrita

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,126  RM Data Scientist

    Dear Amrita,

     

    if you can provide an example process, i would be happy to have a look at your Exec. Script.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi Martin

     

    I'm attaching here the process that I'm testing using execute script:

     

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="execute_script" compatibility="5.3.015" expanded="true" height="76" name="Execute Script" width="90" x="45" y="75">
    <parameter key="script" value=" import java.io.*;&#10; &#10; BufferedReader br=new BufferedReader(new InputStreamReader(System.in));&#10; // System.out.println(&quot;Enter the String: &quot;);&#10; String s= &quot;where are we going. who are you. we we we &quot;;&#10; // System.out.println(&quot;Enter substring: &quot;);&#10; String KeyWords=&quot;here, are, we&quot;;&#10; &#10; String[] splitString=KeyWords.split(&quot;,&quot;);&#10; int ind,count=0;&#10; for(String subString :splitString)&#10; {&#10; &#9; for(int i=0; i+subString.length()&lt;=s.length(); i++) &#10; {&#10; &#9;&#9; &#10; ind=s.indexOf(subString,i);&#10; if(ind&gt;=0)&#10; {&#10; count++;&#10; i=ind;&#10; ind=-1;&#10; }&#10; }&#10; &#10; &#9;// System.out.println(&quot;Occurence of '&quot;+subString+&quot;' in String is &quot;+count);&#10; &#9; return &quot;Occurence of '&quot;+subString+&quot;' in String is &quot;+count&#10; }&#10; return"/>
    </operator>
    <connect from_port="input 1" to_op="Execute Script" to_port="input 1"/>
    <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Please have a look at it. When I run this process, this is the error I'm getting in log:

    WARNING: Unknown result: class java.lang.String: Occurence of 'here' in String is 1

    Regards

    Amrita

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    Hi Amrita,

     

    Couldn't you just upgrade to version 7.2? Version 5.3 is really really old and you miss A LOT of great new performance/operator/extension enhancements.

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi @Thomas_Ott

     

    Yes, 7.2 is a much better version and I use that too. But, for server, I have complete access to 5.3 version. The newest server has some limitations, unless its purchased. So, by the time, my company purchases that, I have to use the older version.

    If there's a way to connect, Rapidminer 7.2 to Rapid Analytics server, do let me know :)

     

    Regards

    Amrita

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    Hi Amrita,

     

    Unfortunately, you can't connect Studio 5.3 to Server 7.2. We offer a FREE version of RapidMiner Server now but it has limitations of 1000 API calls, 2 GB of memory, and 1 logical core. If your employer is using RapidAnalytics, and get value ($$$) from it, it would be great if you guys upgraded so we can continue to innovate!

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi @Thomas_Ott

    We have already planned to buy that within next few months, but, for now, I'm looking for an interim solution.

    Regards

    Amrita

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,126  RM Data Scientist

    Amrita,

    the attached code works for me. I do not know what i really did :). I added the LogService. Those messages are visible in the Log Panel of RM. If this is what you would like to have, i would like to built it with two Example set inputs (1. Texts, 2. List of Keywords) and one example set out puts (input + count_XXX)

     

    Is our sales team already in contact with you? If no, please reach out to me at mschmitz at rapidminer dot com. We might simply help you to get your use case implemented to get the business convinced.

     

    ~Martin

     

    import java.io.*;
    import com.rapidminer.tools.LogService;

    import java.util.logging.Level

    BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
    // System.out.println("Enter the String: ");
    String s= "where are we going. who are you. we we we ";
    // System.out.println("Enter substring: ");
    String KeyWords="here, are, we";

    String[] splitString=KeyWords.split(",");

    int ind,count=0;
    for(String subString :splitString)
    {
    count=0; // everything else makes no sense
    LogService.root.log(Level.INFO,subString)
    for(int i=0; i+subString.length()<=s.length(); i++)
    {

    ind=s.indexOf(subString,i);
    //LogService.root.log(Level.INFO,ind)
    if(ind>=0)
    {
    count++;
    i=ind;
    ind=-1;
    }
    }
    LogService.root.log(Level.INFO,"Occurence of "+subString+" in String is "+String.valueOf(count))

    //return "Occurence of '"+subString+"' in String is "+count
    }
    return
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi Martin

    Thanks a lot for this. This works great. Now, I'll take it from here and join it to the main process. A small clarification: the results that are shown in logs, can be transferred to database or any other form of output, right?
    And yes, I'm in contact with your sales team.

    Thanks again :)

    Regards

    Amrita

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,126  RM Data Scientist

    Amrita,

     

    sure. I will have a look tomorrow.Takes a few minute to get it into a RM example set.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,126  RM Data Scientist

    Amrita,

     

    attached is a process with this script which has proper in and output ports. It's not yet commented. If you need any help to understand, just post here.

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="7.2.003" expanded="true" height="82" name="Subprocess" width="90" x="112" y="289">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="Keywords" value="&quot;here, are, we&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="split" compatibility="7.2.003" expanded="true" height="82" name="Split" width="90" x="179" y="34"/>
    <operator activated="true" class="transpose" compatibility="7.2.003" expanded="true" height="82" name="Transpose" width="90" x="313" y="34"/>
    <operator activated="true" class="rename" compatibility="7.2.003" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
    <parameter key="old_name" value="att_1"/>
    <parameter key="new_name" value="Keywords"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Split" to_port="example set input"/>
    <connect from_op="Split" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Get the keyword list</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.2.003" expanded="true" height="82" name="Subprocess (2)" width="90" x="112" y="34">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;where are we going. who are you. we we we&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="45" y="136">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;where are we going.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="103" name="Append" width="90" x="246" y="34"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Get Test Data</description>
    </operator>
    <operator activated="true" class="execute_script" compatibility="7.2.003" expanded="true" height="103" name="Execute Script" width="90" x="246" y="85">
    <parameter key="script" value="import java.io.*;&#10;import com.rapidminer.tools.LogService;&#10;import com.rapidminer.tools.Ontology;&#10;import java.util.logging.Level&#10;&#10;// Configs for the user&#10;String keywordAttributeName = &quot;Keywords&quot;&#10;String testAttributeName = &quot;Text&quot;&#10;//---------------------------------&#10;&#10;BufferedReader br=new BufferedReader(new InputStreamReader(System.in));&#10;&#10;ExampleSet inputSet = input[0];&#10;ExampleSet keywordlist = input[1];&#10;&#10;Attribute keywordAttribute = keywordlist.getAttributes().get(keywordAttributeName);&#10;&#10;Attribute textAttribute = inputSet.getAttributes().get(testAttributeName);&#10;&#10;List&lt;String&gt; splitString = new ArrayList&lt;String&gt;();&#10;// mh, if there is more than one, concat?&#10;for (Example e : keywordlist){&#10; LogService.root.log(Level.INFO,e.getNominalValue(keywordAttribute))&#10; splitString.add(e.getNominalValue(keywordAttribute))&#10;}&#10;&#10;ExampleTable inputTable = inputSet.getExampleTable();&#10;&#10;//String[] splitString=KeyWords.split(&quot;,&quot;);&#10;int numberOfKeywords = splitString.size();&#10;&#10;Attribute[] outputAttributes = new Attribute[numberOfKeywords];&#10;int k = 0;&#10;for (String subString :splitString){&#10; outputAttributes[k] = AttributeFactory.createAttribute(&quot;count_&quot;+subString, Ontology.INTEGER);&#10; inputTable.addAttribute(outputAttributes[k]);&#10;inputSet.getAttributes().addRegular(outputAttributes[k]);&#10; k++;&#10;}&#10;&#10;int ind,count=0;&#10;String s;&#10;&#10;for(Example e : inputSet){&#10; s = e.getNominalValue(textAttribute);&#10; k = 0;&#10; for(String subString :splitString)&#10; {&#10; count=0; // everything else makes no sense&#10; LogService.root.log(Level.INFO,subString)&#10; for(int i=0; i+subString.length()&lt;=s.length(); i++) &#10; {&#10; &#10; ind=s.indexOf(subString,i);&#10; //LogService.root.log(Level.INFO,ind)&#10; if(ind&gt;=0)&#10; {&#10; count++;&#10; i=ind;&#10; ind=-1;&#10; }&#10; }&#10; e.setValue(outputAttributes[k], count);&#10; ++k;&#10; LogService.root.log(Level.INFO,&quot;Occurence of &quot;+subString+&quot; in String is &quot;+String.valueOf(count))&#10;&#10; }&#10; }&#10;&#10; return inputSet"/>
    </operator>
    <connect from_op="Subprocess" from_port="out 1" to_op="Execute Script" to_port="input 2"/>
    <connect from_op="Subprocess (2)" from_port="out 1" to_op="Execute Script" to_port="input 1"/>
    <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer
  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor I

    Hi Martin

    Thanks a ton. This is exactly what I was trying to do.

    Thanks again :)

    Regards

    Amrita

    sgenzer
  • simon_kuehnesimon_kuehne Member Posts: 6 Contributor I

    Hi,

    I had the same problem and your soultion worked well for me! Thanks!

    In my case I do not want to find only keywords as some keywords are within hashtags, e.g. #ILikeThatKeyword.

    Is there a solution to find also those matches?

     

    Thanks!

    Simon

     

    sgenzer
Sign In or Register to comment.