Options

"Processing multiple xml files for tf-idf"

RucaRuca Member Posts: 13 Contributor II
edited June 2019 in Help
Hi all,

I have an issue regarding processing several news articles available in multiple xml files.
The xml files look the following structure:

article_set1.xml
<article_set>
<article id=1>
 <article_text>...</article_text>
</article>

<article id=2>
 <article_text>...</article_text>
</article>
</article_set>


<article_set2.xml
<article_set>
<article id=10>
 <article_text>...</article_text>
</article>

<article id=11>
 <article_text>...</article_text>
</article>
</article_set>

Meaning that each xml contains different articles to be processed. An article must be considered as a document do be processed by the tf-idf.
My first attempt  was to use the "read xml" operator and connect to a "process documents from data". It works fine, but it only enable to process only one xml file.
Second attempt  was to use a "loop files" iterator in the beginning of the process. By using this approach, it creates a tf-idf vector for each xml file processed.
Third attempt use only the "process documents from files", and process the xml files internally. This approach assumes that a xml is a document.
My objective is that, for each article_id should be considered as a different document, even when multiple xml files need to be processed.

Any guidance on this issue is more than welcome.

Thank you for your support.

Regards,


Ruca
Tagged:

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Ruca,

    if you use Process Documents from Files, you can split a file into its subdocuments via Split Document.
    If you use the Loop Files operator, you can use Read XML, append the data, and use Perform Documents from Data after loop, not in the loop.

    Does that help?

    Best regards,
    Marius
  • Options
    RucaRuca Member Posts: 13 Contributor II
    Hi Marius,

    Thank you very much for your help. I used the "Loop Files operator" using the append data and it works fine!

    My problem is now how to store the results into MySQL database. Since the number of columns in MySQL is limited, I had to perform a transpose operation. Which makes the terms into IDs now.
    I'm getting two different terms: "el-nino" and "el niño". which should be different terms according to UTF-8 character set. Since the terms are now IDs, I'm not able to store these rows on a table because MYSQL assumes that they are the same term.
    I had to change the role of the ID column to regular. It works, but I guess is not the right way to do it.
    Does anyone has any other approach for doing this?

    Thank you for you support!

    Regards,

    Ruca
Sign In or Register to comment.