‎12-05-2017 02:51 PM

 Hello - there was a good question about how to do text mining with strange characters such as emojis.  I like to do a little "ETL jujitsu" when I work with text data like this, converting the text temporarily to unicode/UTF-8 Hex to get unique, easily parsed tokens, and then converting back.  Here's the idea:


1. Import your example set of text data:


Screen Shot 2017-12-05 at 2.43.52 PM.png


2. Get your master set of emojis (I got them from here) and then put them into an Excel doc or whatever.  I like putting the Unicode in brackets so I can find it easily + tokenize if desired (see "Unicode RM" column):


Screen Shot 2017-12-05 at 2.41.11 PM.png


3. Use the Encode URL to convert your text to UTF-8 Hex, Replace the UTF-8 Hex to Unicode or whatever with your Excel Dictionary, and then convert back:


Screen Shot 2017-12-05 at 2.40.49 PM.png

Voilà - perfect conversion (well not bad anyway!)


Screen Shot 2017-12-05 at 2.47.07 PM.png



If you want to put that in a process that counts emojis, just add on some text mining using Process Documents From Data and join back with the original data set:


Screen Shot 2017-12-05 at 2.38.37 PM.png


Thanks to user @gjagiello for the data and the inspiration!




[process attached for those that want to take a look]



Scott Genzer
Senior Community Manager
RapidMiner, Inc.