The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

byte address / word location for Textual ETL

WanttoknowWanttoknow Member Posts: 6 Contributor II
edited June 2019 in Help
Hi,

I'm doing fine with the currently provided operators for text processing in RM 5.0 (great! guys  :-*)

However there is one aspect that I would like to see during the vector creation of words in documents and that is the byte addresses per word occurence as a key to distinguish one word occurence from another.

This would require a whole new representation of the wordlist where every occurence is displayed with a byte address/word location in stead of the aggregated number of occurences per word per document.

This would open up a new range of possibilities such as determining what other words or terms are found in proximity of a certain word/term. This would be of great value to determine the context of documents.

Of course I would be glad to know if this would already be possible with some combination of current operators  ::)

Answers

  • Options
    fischerfischer Member Posts: 439 Maven
    Hi,

    by coincidence this is exactly what we are currently working on. Stay tuned :-)

    Cheers,
    Simon
  • Options
    WanttoknowWanttoknow Member Posts: 6 Contributor II
    Simon,

    Great. Looking forward to it.

    Thanks for your reply
  • Options
    TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 295 RM Product Management
    Hi,
    Wanttoknow wrote:

    However there is one aspect that I would like to see during the vector creation of words in documents and that is the byte addresses per word occurence as a key to distinguish one word occurence from another.

    This would require a whole new representation of the wordlist where every occurence is displayed with a byte address/word location in stead of the aggregated number of occurences per word per document.

    This would open up a new range of possibilities such as determining what other words or terms are found in proximity of a certain word/term. This would be of great value to determine the context of documents.

    Of course I would be glad to know if this would already be possible with some combination of current operators  ::)
    no this is not yet possible, but we are indeed in an initial phase of a re-factoring of the text processing extension. This will also include that the locations of tokens in a document are kept within the tokens, so that 1) the visualization of documents and token sequences will be improved, 2) filtering, token and attribute construction based on the locations of tokens and co-occurances within document regions, etc. will become possible.

    Apart from that, we have a lot of other ideas concerning the text processing extension - so it will probably take a while until the re-structuring is finished, stay tuned .. ;-)
    Kind regards,
    Tobias
Sign In or Register to comment.