Options

How to extract a specific part (section) from a large text (txt format)?

Enthusiast21Enthusiast21 Member Posts: 6 Newbie
Dear RM Friends,

I have 500 txt files containing large Reports and I need to extract only one section of these Reports. As the Reports are each slightly different, the only common patern I can recognise is that the section' headline by all start with the same 3 words, but in the end of each something different is written and the following section is also not the same. My Question is how I can in general extract part of large Texts in RapidMIner (I think I need to use some regular expressions, but so far I could not find anything suitable for my Task).

Thank you very much for your support in Advance! :smile:

Best Answer

Answers

  • Options
    kaymankayman Member Posts: 662 Unicorn
    Regular expressions are probably what you need indeed. You already know where to start so it's about the where to end part. You don't need to limit yourself with words. Whitespace can also be a good candidate. 

    Are your sections bound by linebreaks, or does your next session start with something that resembles a paatern? 
  • Options
    Enthusiast21Enthusiast21 Member Posts: 6 Newbie
    As attachment is part of one report containing two sections of what I need to extract (Independent Auditor's Report), which is another issue - some Reports contain two parts I need to extract. I copied in the attached file also the end of the previous section and the beginning of the next one. The next section is always different in the reports, so I can't find a patern. Each section I need ends with a date, which unfortunately is only common for them, but not uniqe as there are also other dates in the report in general. 
  • Options
    kaymankayman Member Posts: 662 Unicorn
    edited December 2019
    Nice challenge :-)
    So the idea is to first split the content in left and right page, and then get the section?

    Splitting the page in 2 is something you can achieve by splitting on string length, so basically the first 70 characters belong to the first page, 70 to 140 belong to the second page. Splitting and then merging can give you the both pages in one flow.

    Bit of quick and dirty approach can be found in attachment.
  • Options
    Enthusiast21Enthusiast21 Member Posts: 6 Newbie
    Thank you for the solution of the first part of my problem. I'm sorry for the question, but as I am relatively new may I ask you where I enter the xml Code you send me? I tried in the xml pannel, but after that I don't know how to make the process appearing and then running in RapidMiner. 

    About the pattern - I have the beginning that is Independent Auditor's Report, but I don' know About the end as it's a date, but how not to take everything which ends up somewhere with a date? For what other type of pattern I can look for besides words?

    Thank you so much for the support! 
  • Options
    kaymankayman Member Posts: 662 Unicorn
    Views -> xml -> paste and green tick before save
  • Options
    Enthusiast21Enthusiast21 Member Posts: 6 Newbie
    What could I do to remove the error?
  • Options
    kaymankayman Member Posts: 662 Unicorn
    Install the toolbox extension from the marketplace, but you can also replace this with the common append operator
  • Options
    Enthusiast21Enthusiast21 Member Posts: 6 Newbie
    Thank you! I did it, but now I have new problem. Could you help me with it too? 
  • Options
    kaymankayman Member Posts: 662 Unicorn
    Hmm, there might be more issues with your original file. Could you already verify it works with the 'for the forum' txt file you provided? This way we can already ensure we are using the same environmental conditions.
    Then try again on your data after changing the decoding of the decode url's operator to utf-8, this could also solve some encoding problems with your original text.


  • Options
    Enthusiast21Enthusiast21 Member Posts: 6 Newbie
    With the file 'for the forum' it works perfectly, I don't understand why the original one doesn't then as I olny copied part of the text from it in the new txt file which I uploaded here. I tried with an online tool to change to utf-8, but the resulted file didn't give any better results. Is there another ways to decode the file?
  • Options
    kaymankayman Member Posts: 662 Unicorn
    Would you mind sharing the full text? You can send by pm if ok for you.
Sign In or Register to comment.