Text mining of mailing list traffic

Andrea_gAndrea_g Member Posts: 3 Contributor I
edited November 2018 in Help
I've just installed RapidMiner 5.2 and just noticed there is no importer for mailing box format. I'm interested in extracting mailing word frequencies.
Do you know any workflow or tutorial to perform this task with RapidMiner?

Right now I've managed to export the traffic in one big file in CSV format (from Thunderbird) but the RapidMiner CSV importer-parser gets very confused recognizing columns. Sample data can be found in the following list:

http://lists.gforge.inria.fr/pipermail/ecm-discuss/

Any help would be appreciated.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    did you try the Read Documents (Mail) operator from the text mining extension?

    Best,
    Marius
  • Andrea_gAndrea_g Member Posts: 3 Contributor I
    Marius wrote:

    Hi,

    did you try the Read Documents (Mail) operator from the text mining extension?

    Best,
    Marius
    Yes but it seems to read from a mail store, not from a disk. I don't want to download my 4 GB of email and filter them again to start mining. May be there is a way to set up the connection to take mails from a file in local disk?
    Thanks

    Andrea
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Ok, good point. Then let's return to the csv file exported from Thunderbird. I can't find a downloadable csv file at the link you provided, can you post some sample data? Where does Read CSV fail?

    Best, Marius
  • Andrea_gAndrea_g Member Posts: 3 Contributor I
    Hi Marius,

    Just download and uncompress any file which is in in gzip format: http://lists.gforge.inria.fr/pipermail/ecm-discuss/2012-March.txt.gz
    Import into a Firebird new folder.
    Install this extension/add-on: ImportExportTools
    Right click in the folder, Import/Export, Export all messages in the folder, Spreadsheet (CSV)

    Let me know if you cannot reproduce the problem.
    Cheers,

    Andrea
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    the problem is that RapidMiner reads csv files line-wise. If a field contains linebreaks, they are ignored, even if the field is quoted. MS Excel seems to have the same problem. What I could do was:
    1. Import the file with OpenOffice
    2. Save it as MS Excel file
    3. Import the xls file with RapidMiner

    This worked for an exported folder of my own mailbox. I don't know however if that is scriptable for a huge number of files.

    Happy Mining!
    ~Marius
Sign In or Register to comment.