Counting Emojis in Text Mining

sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
edited December 2018 in Knowledge Base

 Hello - there was a good question about how to do text mining with strange characters such as emojis.  I like to do a little "ETL jujitsu" when I work with text data like this, converting the text temporarily to unicode/UTF-8 Hex to get unique, easily parsed tokens, and then converting back.  Here's the idea:

 

1. Import your example set of text data:

 

Screen Shot 2017-12-05 at 2.43.52 PM.png

 

2. Get your master set of emojis (I got them from here) and then put them into an Excel doc or whatever.  I like putting the Unicode in brackets so I can find it easily + tokenize if desired (see "Unicode RM" column):

 

Screen Shot 2017-12-05 at 2.41.11 PM.png

 

3. Use the Encode URL to convert your text to UTF-8 Hex, Replace the UTF-8 Hex to Unicode or whatever with your Excel Dictionary, and then convert back:

 

Screen Shot 2017-12-05 at 2.40.49 PM.png

Voilà - perfect conversion (well not bad anyway!)

 

Screen Shot 2017-12-05 at 2.47.07 PM.png

 

 

If you want to put that in a process that counts emojis, just add on some text mining using Process Documents From Data and join back with the original data set:

 

Screen Shot 2017-12-05 at 2.38.37 PM.png

 

Thanks to user @gjagiello for the data and the inspiration!

 

Scott

 

[process attached for those that want to take a look]

 

 

Comments

  • kaymankayman Member Posts: 662 Unicorn
    edited November 2018
    If you only need to count them, and do not really care about keeping the emoticons itself you could also use a unicode range to replace, this saves you the excel file and maintenance

    If in the given example the unicodes range from 1F601 to say 1F64F, then you can get this in one go using the replace operator as follows

    [\u1F601-\u1F64F]  ->  somespecialthingy



  • kaymankayman Member Posts: 662 Unicorn
    If you only need to count them, and do not really care about keeping the emoticons itself you could also use a unicode range to replace, this saves you the excel file and maintenance

    If in the given example the unicodes range from 1F601 to say 1F64F, then you can get this in one go using the replace operator and entering [\u1F601-\u1F64F]  -> replace with somespecialthingy


Sign In or Register to comment.