"Extract Words from Text based on predefined set of keywords"

evelyn_baranievelyn_barani Member Posts: 1 Contributor I
edited May 23 in Help

Hi all,

I am very knew to RapidMiner.  

 

I have a set of news articels and I want to find out if the articels include given words (from an excel file). I also want to find out how oft one particular word ocurrs. 

 

I've been reading a lot in the forum, but havent found a solution yet. 

 

Can anybody help out?

 

Answers

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,317  Community Manager

    hello @evelyn_barani - welcome to the community. So it would be very helpful if you could post your data set so we can see exactly what you're working on.

     

    There are many tools to do what you want to do with these articles. Most likely you'll want to download the Text Processing extension from the marketplace and use those tools. Like this:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="9.0.003" expanded="true" height="68" name="Retrieve REDUCED job post data set (5862 examples)" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Community Samples/Community Data Science/Text Mining Tutorials by Neil McGuigan/data/REDUCED job post data set (5862 examples)"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="JobDescription.contains.manager"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">job description contains &amp;quot;manager&amp;quot;</description>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve REDUCED job post data set (5862 examples)" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

     

Sign In or Register to comment.