Options

Utilizing web page character encoding

colocolo Member Posts: 236 Maven
edited November 2018 in Help
Hi,

I have a few questions regarding character encoding especially for the text processing and web operators.
It is possible to set a specific encoding for documents if you read them via "Read Document" for example. Other operators like "Create Document" don't offer this parameter setting (maybe the encoding setting of the process is used for these?). Since I'm dealing with information extraction from the web the operator "Get Page" is very important for me. This one doesn't allow making specific encoding settings, too. I would assume that the encoding header from the HTTP protocol is used for the received text content. But what if there is no encoding specified or the content is zipped (as in my currenct example - the header just says: Content-Encoding: gzip)?
The page is UTF-8 encoded which is correctly indicated by the Content-Type meta tag inside the HTML head. Will this information be considered for the storage of the document content? In my case it does not seem so, because special characters (such as Umlaute) are not displayed properly. If I store this document to a file and read it via "Read Document" (encoding set to UTF-8) everything is fine. But this seems a bit complicated when the needed information was already at hand when the content was read for the first time.

This leads me to the following questions:

- Is there a way to pay respect to the content-type meta information from the HTML source when retrieving a page?
- Would it be possible to add an optional parameter that allows setting a desired encoding for the "Get Page" operator?
- Is there a way to set/change specifig encodings for (existing) documents?

Any hints would be much appreciated.

Best regards,
Matthias

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    we encountered the exactly same issues as you, but unfortunately there's nothing you can do about it currently. Of course we could add such parameters, and probably we will someday in the future, but until then you are stuck.
    And again being an enterprise customer would have been handy, isn't it? :)

    Greetings,
    Β  Sebastian
Sign In or Register to comment.