The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Extracting Emoji from tweets in tiwtter

c1579481c1579481 Member Posts: 2 Contributor I
edited November 2018 in Help

Extracting Emoji from tweets in twitter



Hello every one .....

I need help or answer aboout if it is poosible to extrcat just emoji from the tweets in twitter which I chose it from the populer hashtages and if it is , I need the tpis please . 

thanks 


Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Cross posting everywhere will not get you the answer sooner. 

     

    I will delete the other topics. 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You would need to set your encoding to the appropriate type under Preferences. For example UTF-8 will extract a lot of emoticon short codes, i.e. ": )" for :smileyhappy:

  • c1579481c1579481 Member Posts: 2 Contributor I

    but I dont need spicific code , I am trying to check the using of emoji in tweets so I expect all the kinds of emoji , in this way I should add all the unicode of the emoji ???

    thanks 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    If you want to do text processing and extract out the emoji's and hashtags, you'll have to transform them into something that won't be destroyed during tokenization.  For example, the smiley emoji is typically represented as ": )" (space and quotes added for clarity). If you use the default tokenization settings, that will be wiped out and you won't be able to extract information from it.  

     

    What I typically do is use a few Replace operators to replace the ": )" with "smiley_face" and "#myawesomehashtag" with "hashtag_myawesomehastag."  Then when you tokenize it, it will still remain in the text processing. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
    <parameter key="connection" value="ThomasOtt"/>
    <parameter key="query" value="love"/>
    <parameter key="language" value="en"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    <parameter key="replace_what" value="\:\)"/>
    <parameter key="replace_by" value="smiley_face"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Replace (2)" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    <parameter key="replace_what" value="\#(.*)"/>
    <parameter key="replace_by" value="hashtag_$1"/>
    </operator>
    <connect from_op="Search Twitter" from_port="output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
    <connect from_op="Replace (2)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • gjagiellogjagiello Member Posts: 2 Contributor I

    Hello! Let's say I have a large set of examples that includes a 'comment' attribute, and that attribute original data (.xlsx) looks like so:

     

    life's great ✨
    Girls Night Out ❤️❤️❤️❤️
    ????
    We baddies ???
    Friend in trouble so I'm babysitting her son that loves me?
    Goodnight ❤️

     

    What I'd like as a result is a set where the examples are unique emoji and a count of the appearances of that emoji, as found in the 'comment' attribute for all examples in the set, something like:

     

     ✨ - 1

     ❤️ - 5

    ? - 5

    ? - 1

    ? - 1 

    ? -1

     

    This is a data prep step for some other processing I (am pretty sure) know how to perform in RapidMiner. Note that I need to see the actual emoji as entered by the user for my use case.

     

    I've tried a lot of Google-fu and RapidMiner trial-and-error (and more error) but have come up stumped. Any thoughts here to guide a relative newcomer? Thank you for your consideration.

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @gjagiello - welcome to the community.  I love this kind of ETL ju-jitsu.  :)  The trick that I always use in situations like this is to convert the text to UTF-8 Hex, replace to something recognizable like @Thomas_Ott suggested, and convert back.  So for example if you look at your heart emoji, that gets converted using "Encode URL" into "%E2%9D%A4%EF%B8%8F" (look at data after breakpoint of Encode URL).  Then I use Replace to convert to something normal, and then find word occurrences.  If you have a lot of emojis, you can use a replace dictionary.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.0.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="/Users/GenzerConsulting/Desktop/emojis.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations"/>
    <parameter key="encoding" value="UTF-8"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="att1.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.0.000" expanded="true" height="82" name="Subprocess" width="90" x="179" y="34">
    <process expanded="true">
    <operator activated="true" class="web:encode_urls" compatibility="7.3.000" expanded="true" height="82" name="Encode URLs" width="90" x="45" y="34">
    <parameter key="url_attribute" value="att1"/>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.000" expanded="true" height="82" name="Replace" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="replace_what" value="%E2%9D%A4%EF%B8%8F"/>
    <parameter key="replace_by" value=" [RH] "/>
    <description align="center" color="transparent" colored="false" width="126">Red Heart</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.000" expanded="true" height="82" name="Replace (2)" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="replace_what" value="%F0%9F%92%99"/>
    <parameter key="replace_by" value=" [BH] "/>
    <description align="center" color="transparent" colored="false" width="126">Blue Heart</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.000" expanded="true" height="82" name="Replace (3)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="replace_what" value="%F0%9F%92%AF"/>
    <parameter key="replace_by" value=" [100] "/>
    <description align="center" color="transparent" colored="false" width="126">100</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.000" expanded="true" height="82" name="Replace (4)" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="replace_what" value="%F0%9F%98%8D"/>
    <parameter key="replace_by" value=" [HF] "/>
    <description align="center" color="transparent" colored="false" width="126">Heart Face</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.000" expanded="true" height="82" name="Replace (5)" width="90" x="715" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="replace_what" value="%F0%9F%94%A5"/>
    <parameter key="replace_by" value=" [Flame] "/>
    <description align="center" color="transparent" colored="false" width="126">Flame</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.000" expanded="true" height="82" name="Replace (6)" width="90" x="849" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="replace_what" value="%E2%9C%A8"/>
    <parameter key="replace_by" value=" [Stars] "/>
    <description align="center" color="transparent" colored="false" width="126">Stars</description>
    </operator>
    <operator activated="true" class="web:decode_urls" compatibility="7.3.000" expanded="true" height="82" name="Decode URLs" width="90" x="983" y="34">
    <parameter key="url_attribute" value="att1"/>
    </operator>
    <connect from_port="in 1" to_op="Encode URLs" to_port="example set input"/>
    <connect from_op="Encode URLs" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
    <connect from_op="Replace (2)" from_port="example set output" to_op="Replace (3)" to_port="example set input"/>
    <connect from_op="Replace (3)" from_port="example set output" to_op="Replace (4)" to_port="example set input"/>
    <connect from_op="Replace (4)" from_port="example set output" to_op="Replace (5)" to_port="example set input"/>
    <connect from_op="Replace (5)" from_port="example set output" to_op="Replace (6)" to_port="example set input"/>
    <connect from_op="Replace (6)" from_port="example set output" to_op="Decode URLs" to_port="example set input"/>
    <connect from_op="Decode URLs" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">convert emojis to unicode and then to [xx] notation</description>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="mode" value="regular expression"/>
    <parameter key="expression" value="\s"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Subprocess" to_port="in 1"/>
    <connect from_op="Subprocess" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Thank you for the entertainment.  I love this stuff.


    Scott

     

    [EDIT: oh sorry - if you only want a list of occurences of the emojis instead of all the tokens, you could simply filter for them only.

     

  • gjagiellogjagiello Member Posts: 2 Contributor I

    Scott, thanks for the reply and the great suggestion! I'm going to try this out and report back...you gave me an idea I'll share if I can get it to work. Glad you enjoy this data sparring! :D

Sign In or Register to comment.