preprocessing: remove email signature

JoosJoos Member Posts: 11 Newbie
I am trying to apply LDA to emails. I have the mails in an excel file. My model works, but I have to find a way to remove the email signature. Does anyone have experience?


Best Answer


  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello @Joos,

    I can only recommend two ways. The first one is to remove everything from the last -- signs together to the end. Or, if you have the recipient of the e-mail, trim the message and check the last line on each e-mail until no last lines are the same.

    Both aren't battle tested, and involve some processing that I wouldn't have done with RapidMiner but much earlier, while retrieving the e-mails, so you are better of trying your luck with loading your data with Python to remove the e-mail signatures, I'm afraid.

    All the best,

  • Options
    JoosJoos Member Posts: 11 Newbie
    Thank you for your answer. Not sure I understand your first option, because all the footers are different, so I do not know how to recognize them. I did find python code on github (mailparser). Is it possible to include this as python script in the code? I can include it in my loop going over the different mails and pass it on the python parser as a document? Probably the python code would need adjustment to get this working? Moreover, it would have to do the parsing in Dutch? Do you have experience in this? Your 
  • Options
    JoosJoos Member Posts: 11 Newbie
    Thanks Rodrigo...I kind of fixed the issue in excel with formulas
Sign In or Register to comment.