Text Processing - Cut Document - Similar entries separated by a number

exmenaceexmenace Member Posts: 3 Newbie
I was having trouble finding the operator documentation that pertains to string matching or cutting documents in general.
I have a few different types of documents (.xml, .csv, .docx, .html) that list records, in order, separated by *Record (n)* in ascending numbers, starting with 1.
Each of these records has similar attributes but it's all unformatted other than the records and attributes being separated by asterisks*.
My hope was to cut the document by record, which I assumed I could do with a string matching query, but I'm not sure how I could do that if each record is different, and the only commonality being the record #, but that's variable so not sure how to input that expression.
Tagged:

Best Answer

  • kaymankayman Member Posts: 662 Unicorn
    Solution Accepted
    Are your records each time on a new line?
    like : 
    Record 1*something*someting else*and again something else
    Record 2*something*someting else*and again something else
    Record 3*something*someting else*and again something else

    or is it more like 

    Record 1*something*someting else*and again something else*Record 2*something*someting else*and again something else*Record 3*something*someting else*and again something else

    In case of the first you could simply use the read csv operator and use the * as the separator. Beware that this is a special character that needs to be escaped, so in order to use it correct you need to enter \* instead of just *

    You could also use the split operator, same here. Use \* to make clear you want to split on the 'normal' asterix.

    If all is in one line I recommend to use the split document into collection from the toolbox extension. 

    I've attached some samples to play around with, hope they get you started.


Answers

  • exmenaceexmenace Member Posts: 3 Newbie
    Thanks for the recommendations, I will check them out. I found a way to rewrite the csv files so that the info isn't in separate rows. The structure in the csvs and excel files has the info all in one column, but each category is on a different row, so something like this:

    *Record 1:*
    *Title:*
    The Blah blah of blah
    *Author:*
    M. Blah
    *Keywords:*
    Blah, blah, blah

    And so on. Including a row that will have lots of text that's an Abstract. The way my other program is rewriting it is so that all the info for each record is in one cell, and then I have a process that will separate that into a more manageable table (hopefully).

    I will check out your solution and compare the two because I will still have to analyze after all this. Any other input is appreciated.
  • kaymankayman Member Posts: 662 Unicorn
    Ah, I see. All in one row isn't a real problem but it's just a bit more complex then. The split document should do the trick then also, just look for \*Record, put a special string in front of it and use that to split. Then you have small docs for every topic, and these you split again on line breaks, or use a transpose to change them from column to attributes. 

    Is the number of rows each time the same, like in your example for instance 7 rows, then next 7 for a new topic and so on?

    If so you could also use a loop logic and filter each time 7 records on every 7th entry using a mod logic. Sounds far more complex than needed btw :-) 
  • exmenaceexmenace Member Posts: 3 Newbie
    I've definitely overcomplicated the whole thing. Thanks again
Sign In or Register to comment.