How to change rich text into readable one? (for text mining)

duyguduygu Member Posts: 12 Contributor II
edited April 2020 in Help
Hi everyone!

Finally i read my database with rapidminer. But, again, there is a problem. My items look like this;

{\rtf1\ansi\ansicpg1254\deff0{\fonttbl{\f0\fnil\fcharset162 Microsoft Sans Serif;}}
\viewkind4\uc1\pard\lang1055\f0\fs17 6 ayd\'fdr sol kol a\'f0. Boyun a\'f0 az.
G\'fc\'e7s\'fczl\'fck ve a\'f0 dan \'e7ok uyu\'feukluk var. NPBY. Belki C7-8 hipoaljezi.
Torasik \'e7\'fdk\'fd\'fe gibi de\'f0il. Miyofasial a\'f0 gibi. \'d6neriler.+\par \par \par \par \par }

How can i change this into a readable text? I need to do text mining :)


  • Options
    homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist
    Hi duygu,

    have already installed the text mining extension? If yes you will find an operator called "Data to Documents" which can be used to migrate an example set to a document object. But to answer your question, currently there is no option to parse rtf code directly in RapidMiner. Maybe you'll find some library or scripting tool you can pipe your data through. What you could try to get the text content from your input is to filter the rtf code via regular expressions (using "Replace" or "Replace Token" operator) with a search pattern like this:
    [tt]\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?[/tt]
    Since text mining is a rather complex topic it may be a good idea to take a closer look at some useful introduction videos. A video which shows how to classify texts dealing with different topics can be found here:
    In addition to that Neil McGuigan produced a great series of videos dealing with RapidMiner and Text-Mining which are available via his blog:
    http://vancouverdata.blogspot.de/2010/11/text-analytics-with-rapidminer-loading.html shows the first one of the series.

  • Options
    duyguduygu Member Posts: 12 Contributor II
    Yes, i already tried "Data to Documents" but i have never thought about reguler expressions. I'm going to try it now.

    Yeah, Neil McGuigan's site really helpful :D

    Thank you!
  • Options
    duyguduygu Member Posts: 12 Contributor II
    I couldnt do it with a regular expression because i couldnt decide what to replace with regular expression. So I try to coopy my document into a file (with WriteDocument operator) but now i cant see all the content in document. I just can see a few lines even though the document is 27 MB.
Sign In or Register to comment.