RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
Counting Emojis in Text Mining
Hello - there was a good question about how to do text mining with strange characters such as emojis. I like to do a little "ETL jujitsu" when I work with text data like this, converting the text temporarily to unicode/UTF-8 Hex to get unique, easily parsed tokens, and then converting back. Here's the idea:
1. Import your example set of text data:
2. Get your master set of emojis (I got them from here) and then put them into an Excel doc or whatever. I like putting the Unicode in brackets so I can find it easily + tokenize if desired (see "Unicode RM" column):
3. Use the Encode URL to convert your text to UTF-8 Hex, Replace the UTF-8 Hex to Unicode or whatever with your Excel Dictionary, and then convert back:
Voilà - perfect conversion (well not bad anyway!)
If you want to put that in a process that counts emojis, just add on some text mining using Process Documents From Data and join back with the original data set:
Thanks to user @gjagiello for the data and the inspiration!
[process attached for those that want to take a look]