Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Text mining of mailing list traffic
I've just installed RapidMiner 5.2 and just noticed there is no importer for mailing box format. I'm interested in extracting mailing word frequencies.
Do you know any workflow or tutorial to perform this task with RapidMiner?
Right now I've managed to export the traffic in one big file in CSV format (from Thunderbird) but the RapidMiner CSV importer-parser gets very confused recognizing columns. Sample data can be found in the following list:
http://lists.gforge.inria.fr/pipermail/ecm-discuss/
Any help would be appreciated.
Do you know any workflow or tutorial to perform this task with RapidMiner?
Right now I've managed to export the traffic in one big file in CSV format (from Thunderbird) but the RapidMiner CSV importer-parser gets very confused recognizing columns. Sample data can be found in the following list:
http://lists.gforge.inria.fr/pipermail/ecm-discuss/
Any help would be appreciated.
0
Answers
did you try the Read Documents (Mail) operator from the text mining extension?
Best,
Marius
Thanks
Andrea
Best, Marius
Just download and uncompress any file which is in in gzip format: http://lists.gforge.inria.fr/pipermail/ecm-discuss/2012-March.txt.gz
Import into a Firebird new folder.
Install this extension/add-on: ImportExportTools
Right click in the folder, Import/Export, Export all messages in the folder, Spreadsheet (CSV)
Let me know if you cannot reproduce the problem.
Cheers,
Andrea
the problem is that RapidMiner reads csv files line-wise. If a field contains linebreaks, they are ignored, even if the field is quoted. MS Excel seems to have the same problem. What I could do was:
1. Import the file with OpenOffice
2. Save it as MS Excel file
3. Import the xls file with RapidMiner
This worked for an exported folder of my own mailbox. I don't know however if that is scriptable for a huge number of files.
Happy Mining!
~Marius