How to re-use API query results after adding subsequent operators without making a second API call?

batstache611batstache611 Member Posts: 45 Guru
edited November 2018 in Help

Hello,

 

So I am using a third party extension by AYLIEN for sentiment analysis. But in the free version, there are limitations as to how many calls I can make in a minute, or day, or month. I'm not sure but I think they allow 100 hits per minute or probably lower for free users. That is fine, I can understand that.

 

What I would like to know is that once I get some results back from the AYLIEN API, how do I use those same results after making some changes to my process, such as- adding an operator or two after the AYLIEN operator (without having to store the results from the initial run and clutter the reposiroty, or making another call the API)? I surely do not want to exceed my daily quota for the API but I also cannot spend $49 USD per month in which case, I still get a rate limit of 120 hits per minute.

 

The reason I ask this is because I get to use SAS-Enterprise Miner in school and in there, we can basically just execute the new node that we add to the process, all the results from previous nodes (be it data manipulation, or modeling) are available to work-on on anytime as long as the project file iss open. If there is something similar in RapidMiner, I'd like to know. Please and thank you very much.

Best Answers

  • pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research
    Solution Accepted

    Hi Batstache,

    if I understand you correct, you want to cache results from the API call and get an handle on what text is processed through the API. I would recommend controlling the input to API using Operators through the "Filter Example Range" operator and using the "Store" Operator immediately after the API using Operator.

    Here is a sample process, that reads some tweets I've collected about a given Twitter user (they're loaded from my repository here) and filter out only a few of them to do some sentiment analysis on them. Afterwards the result is stored in my repository in a "results" folder before being processed further.

    limit_api_calls.PNGLimit data for API calls and store results immediately

    To access previously stored results from API calls and merge them together, you can use the "Retrieve" Operator to load the stored data from your repository and append new results to it, as displayed in this sample process:

    merging_results.PNGMerging previous results with new ones

     

    Keep in mind, that you need to create an Aylien Text Analysis Connection first. Mine is named "Home" here. Find the xml of the first process here:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Retrieve rapidminer tweets" width="90" x="380" y="34">
    <parameter key="repository_entry" value="../data/rapidminer tweets"/>
    <description align="center" color="transparent" colored="false" width="126">input data set</description>
    </operator>
    <operator activated="true" class="filter_example_range" compatibility="7.5.001" expanded="true" height="82" name="Filter Example Range" width="90" x="514" y="34">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="10"/>
    <description align="center" color="transparent" colored="false" width="126">control the amount of input you process</description>
    </operator>
    <operator activated="true" class="com.aylien.textapi.rapidminer:aylien_sentiment" compatibility="0.2.000" expanded="true" height="68" name="Analyze Sentiment" width="90" x="648" y="34">
    <parameter key="connection" value="Home"/>
    <parameter key="input_attribute" value="Text"/>
    </operator>
    <operator activated="true" class="store" compatibility="7.5.001" expanded="true" height="68" name="Store" width="90" x="782" y="34">
    <parameter key="repository_entry" value="../results/temp_result"/>
    <description align="center" color="transparent" colored="false" width="126">store results to avoid calling the API to often</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.5.001" expanded="true" height="82" name="Subprocess" width="90" x="916" y="34">
    <process expanded="true">
    <connect from_port="in 1" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">add awesome post processing in here</description>
    </operator>
    <connect from_op="Retrieve rapidminer tweets" from_port="output" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Analyze Sentiment" to_port="Example Set"/>
    <connect from_op="Analyze Sentiment" from_port="Example Set" to_op="Store" to_port="input"/>
    <connect from_op="Store" from_port="through" to_op="Subprocess" to_port="in 1"/>
    <connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    Tipp:

    If you encapsule the shown process in an "Loop Parameter" Operator to automatically iterate over example ranges, have a look at macros. You can use them e.g. to create changing names for the intermediate results you are storing. For example a result could be named "sentiment_result_%{execution_count}". This would replace %{execution_count} with the number of the current loop execution resulting in repository entries named: "sentiment_result_1", "sentiment_result_2", "sentiment_result_3". Consult this article for further information on macros.

     

     

    Cheers,

    Philipp

  • pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research
    Solution Accepted

    You're welcome =)

     

    I didn't know the intended timespan between API calls, so I provided a solution, with an unlimited timespan inbetween. Sure you can use "Remember" and "Recall" for only keeping the ExampleSets in memory for a short period of time. If you're going to create a loop in which objects are remembered and recalled, you might run into the situation, that nothing can be recalled in the first iteration step. For that, you can use the "Handle Exception" Operator to avoid such situations.

     

    In general it is a good idea, to encapsulate API calling Operators in "Handle Exception" to provide means of handling unforeseen problems.

     

    BTW: If you are trying to avoid repository entries because of disk limitations, nevermind, but if it's because of organisation, you might want to use a "temp" subfolder in your result folder and empty it every now and then.

     

    Cheers,

    Philipp

Answers

  • batstache611batstache611 Member Posts: 45 Guru

    Thank you @pschlunder, I was just hoping to avoid having to touch the repository so as to keep it organized and free of unnecessary examplesets, i.e. intermediate results. Your process is one possible solution and yes macro's will help automating it a little bit. I recently discovered the remember and recall operators that work with macros. Will those help in avoiding the repository?

Sign In or Register to comment.