Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

byte address / word location for Textual ETL

WanttoknowWanttoknow Member Posts: 6 Contributor II
edited June 2019 in Help
Hi,

I'm doing fine with the currently provided operators for text processing in RM 5.0 (great! guys  :-*)

However there is one aspect that I would like to see during the vector creation of words in documents and that is the byte addresses per word occurence as a key to distinguish one word occurence from another.

This would require a whole new representation of the wordlist where every occurence is displayed with a byte address/word location in stead of the aggregated number of occurences per word per document.

This would open up a new range of possibilities such as determining what other words or terms are found in proximity of a certain word/term. This would be of great value to determine the context of documents.

Of course I would be glad to know if this would already be possible with some combination of current operators  ::)

Answers

  • fischerfischer Member Posts: 439 Maven
    Hi,

    by coincidence this is exactly what we are currently working on. Stay tuned :-)

    Cheers,
    Simon
  • WanttoknowWanttoknow Member Posts: 6 Contributor II
    Simon,

    Great. Looking forward to it.

    Thanks for your reply
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 295 RM Product Management
    Hi,
    Wanttoknow wrote:

    However there is one aspect that I would like to see during the vector creation of words in documents and that is the byte addresses per word occurence as a key to distinguish one word occurence from another.

    This would require a whole new representation of the wordlist where every occurence is displayed with a byte address/word location in stead of the aggregated number of occurences per word per document.

    This would open up a new range of possibilities such as determining what other words or terms are found in proximity of a certain word/term. This would be of great value to determine the context of documents.

    Of course I would be glad to know if this would already be possible with some combination of current operators  ::)
    no this is not yet possible, but we are indeed in an initial phase of a re-factoring of the text processing extension. This will also include that the locations of tokens in a document are kept within the tokens, so that 1) the visualization of documents and token sequences will be improved, 2) filtering, token and attribute construction based on the locations of tokens and co-occurances within document regions, etc. will become possible.

    Apart from that, we have a lot of other ideas concerning the text processing extension - so it will probably take a while until the re-structuring is finished, stay tuned .. ;-)
    Kind regards,
    Tobias
Sign In or Register to comment.