PDF encoding issue

limegreenman900limegreenman900 Member Posts: 26 Contributor I
edited November 2018 in Help

Hi everyone,

 

I was trying to do the most simple one can do, by reading a PDF file into RM.... I have done this several times before, but now I am stuck with (I suspect) an encoding issue.

After using the "Read Document" Operator (extract text only and use file extension as type are tick-marked) I inserted a breakpoint, before I do some preprocessing of the text. However I don't get any text out of my PDF, what I get instead is something like:


¨ÉøC&13#s$ó/Y¢¬–¬³ÙÜìâì=ÙOsbsúrnåºç&#26;sOæ1óŠòvç=Ë�ËïÏŸ\ä»hÙ¢ó&#5;Ö&#5;ê‚#…¤Â¼Â�…³‹ã&#23;oZ<]&#20;TÔUt}‰`IÃ’sK­—V-ý¤˜Y,+>TB(É/ÙSòƒ,]6*›-•–¾W:#—È7Ë&#31;*¢&#21;&#3;Š&#7;Ê&#8;e¿ò^YDY&#127;Ù}U„j£êAyTù`ù#µD=¬þ¶"©b{ųÊôÊ&#15;+&#127;¬Ê¯: !kJ4Gµ&#28;m¥ötµ}uCõ%�—®K7Y&#19;V³©fFŸ¢ßY&#11;Õ.©=bàá?S&#23;ŒîÆ•Æ©ºÈº‘ºçõyõ‡&#26;Ø
Ú†&#11;�ž�k&#26;ï5%4ý¦&#25;m–7Ÿlqlio™Z&#22;³lG+ÔZÚz²Í¹­³mzyâò]íÔöÊö?uøuôw|¿"&#127;űN»Îå�wW&®ÜÛe֥ﺱ*|ÕöÕèjõê‰5&#1;k¶¬yÝ­èþ¢Ç¯g°ç‡^yï&#23;kEk‡Öþ¸®lÝD_p߶õÄõÚõ×7DmØÕÏîoê¿»1mãá&#1;l {àûMÅ›Î
&#6;&#14;nßLÝlÜ<9”úO

Anyone an idea where the problem is? I would suggest that it is an encoding issue?!

 

If I go into the PDF file and Copy+Paste the text into a Word File there is no problem and the text is displayed in a correct manner....

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    You can change the encoding on the Read Documents operator. Just enable the advanced settings and a new parameter box will show up in the parameter window. From there you can change the encoding. 

  • limegreenman900limegreenman900 Member Posts: 26 Contributor I

    I am working with RM5.3, so by displaying the "Read Document" operator encoding is set by default to "System". This should automatically match the correct encoding right?

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,107  RM Data Scientist

    Hi,

     

    usually it is. If you have a UTF file on a windows machine it might not work. So I would give it a try with UTF-8.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Thomas_Ott
  • limegreenman900limegreenman900 Member Posts: 26 Contributor I

    @mschmitz: I gave it a try with UTF, but it didn't work. I'll figure out another way, somehow it has to work.

    Nevertheless, thanks for your help.

Sign In or Register to comment.