RapidMiner

a week ago

 Hello - there was a good question about how to do text mining with strange characters such as emojis.  I like to do a little "ETL jujitsu" when I work with text data like this, converting the text temporarily to unicode/UTF-8 Hex to get unique, easily parsed tokens, and then converting back.  Here's the idea:

 

1. Import your example set of text data:

 

Screen Shot 2017-12-05 at 2.43.52 PM.png

 

2. Get your master set of emojis (I got them from here) and then put them into an Excel doc or whatever.  I like putting the Unicode in brackets so I can find it easily + tokenize if desired (see "Unicode RM" column):

 

Screen Shot 2017-12-05 at 2.41.11 PM.png

 

3. Use the Encode URL to convert your text to UTF-8 Hex, Replace the UTF-8 Hex to Unicode or whatever with your Excel Dictionary, and then convert back:

 

Screen Shot 2017-12-05 at 2.40.49 PM.png

Voilà - perfect conversion (well not bad anyway!)

 

Screen Shot 2017-12-05 at 2.47.07 PM.png

 

 

If you want to put that in a process that counts emojis, just add on some text mining using Process Documents From Data and join back with the original data set:

 

Screen Shot 2017-12-05 at 2.38.37 PM.png

 

Thanks to user @gjagiello for the data and the inspiration!

 

Scott

 

[process attached for those that want to take a look]

 

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.