A noob question about Mining a mailing list

hocquethocquet Member Posts: 2 Contributor I
Dear Miners,

I am a total noob in the field of data mining, i have just googled along in the past few days about definitions and possibilities and here is my project (and question).

In a sort of "digital humanities" research, i would like to use an electronic mailing list archive to conduct an analysis in the form of :

1) a text mining project : twenty years of email exchange may host valuable information. i have no doubt that RapidMiner can do that.

2) an "evolution of topics over time" project : is the "time series extension" (or other rapidminer function) a possibility to extarct information and plot it over time ?

3) a "social network" analysis : in the same manner than twitter or facebook mapping can be done, would it be possible  to show relations between participants in the mailing list ?

is an electronic mailing list an easy (and known) corpus to extract information from in the three aspects mentionned above ? (i could not find anything about mailing lists rapid mining in google)

My targeted mailing list is available as an archive on the web (either as downloadable text files or as a html pages on the web) :


Is processing of this kind of corpus something easy ? something already done elsewhere ?

Thanks for your comments and apologies for my english,

Alexandre Hocquet


  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Is processing of this kind of corpus something easy ? something already done elsewhere ?
    well, it's of course possible. The text mining part might be the most straightforward one: if you already have some topics, you might want to classify the rest of maybe the detection and development of new topics might be of interest. Or relations between users and what they talk about. Or topic extraction. Or...

    For the "evolution over time" part you will probably not need the value series extension. Just divide the data in (maybe) overlapping parts according to the timeline and automatically compare the results for each time frame.

    For the social network analysis, RapidNet might be the best option. We have actually combined RapidNet with RapidMiner / RapidAnalytics in customer projects but this is something which might be hard to do without experience and the community editions only...

  • Options
    hocquethocquet Member Posts: 2 Contributor I
    Thank you very much Ingo for your reply. In the meantime, i have begun to tinker around with rapid miner and managed to get words occurencies according to time, by, like you suggest dividing my corpus yearly.

    i'll now try to get into the text mining part. maybe, with a few modifications, i could use tools to treat my corpus as emails (with author, subjest, date...) instead of pure text, if some specific processing tools exist for emails.

    i'll give it a shot.


Sign In or Register to comment.