"about samples of text ming plugin 4.4"

tigytigy Member Posts: 3 Contributor I
edited May 2019 in Help
1. It seems that there is no such a folder called  ../data/log_files  in the sample folder.

2. currently, the config file for LogFileSource only has a one for apache server.

Can anyone here kindly provide a config file for windows IIS log?

The log I obtained is in the format as shown below:
IP - - [DD/MMM/YYYY:00:00:01 +0800] "GET /XX/XXXXX/index.html HTTP/1.0" 200 82658 "http://www.xx.xxx.xx/xx/xxx/xxx/xxxxx/xxxxxx/xxxxx/xxxxx/index.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.5))"

How should I modify the configue file? Please help me.


  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    1. seems to be correct.
    2. That's correct, too. But if you want to have a config file for IIS, you could take a look at this link: http://polliwog.sourceforge.net. The format is described there. Probably it will not too complicated to write your own config file.


    PS: If you don't want other users have to do the same, you could provide your config file here. We will then include it into the text plugin.

  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    I looked at the polliwog site and I believe this is the config file required for IIS.

    Copyright 2005 - Gary Bentley

    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at


    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.


      This log format models the W3C Extended Log File format (as used by IIS, it models
      only the fields defined by: http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/676400bc-8969-4aa7-851a-9319490a9bbb.mspx
      (Why can't M$ ever do short URLs...)

      Basically the following fields are included:

        Date (date)
        Time (time)
        Client IP Address (c-ip)
        User Name (cs-username)
        Server IP Address (s-ip)
        Server Port (s-port)
        Method (cs-method)
        URI Stem (cs-uri-stem)
        URI Query (cs-uri-query)
        HTTP Status (sc-status)
        Bytes Sent (sc-bytes)
        User Agent cs(User-Agent)
        Referer cs(Referer) // This has been added since it is unlikely that it won't be present!

      Note the order of the field elements IS important.  The fields are read in and the log entry
      processed by getting each field to "consume" the part that it handles.  The remainder of the
      entry is then passed to the next field.

        Date/time of the entry. (Date Time)

        Note:  If your log file is in a language OTHER THAN english then you should modify the "locale" param value below.  Usually, if you are using Apache then the log file will be written (especially the dates) in "english".  The value should have 2 parts, the first part is the "language" (one of the constants defined in: http://www.loc.gov/standards/iso639-2/englangn.html, from the 639-1 column ONLY), the second part should be the "country" (one of the constants defined: http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html). ; The values should be separated by "/".  i.e. "en/US" or "fr/FR".  Only change this value IF your log file is written in a language other than English.

      A token count of 2 is needed here so that both the date and time fields can be concatenated together to form the date/time needed by the field.
      <field class="org.polliwog.fields.DateTimeField"
        <param id="locale"
              value="en/US" />
        <param id="format"
              value="yyyy-MM-dd HH:mm:ss" />

        The hostname part of the log file.  (Client IP Address)
      <field class="org.polliwog.fields.HostnameField" />

        A blank field used to "skip" that part of the line. (Username)
      <field blank="true" />

        The server ip address part of the log file.  (Server IP Address)
        Whilst the field is parsed and built, the "blank" indicates to the processor
        that it shouldn't try and add this field to the log entry.
      <field class="org.polliwog.fields.HostnameField"
            blank="true" />

        The server port.  (Server Port)
        Whilst the field is parsed and built, the "blank" indicates to the processor
        that it shouldn't try and add this field to the log entry.
      <field class="org.polliwog.fields.SizeField"
            blank="true" />

        The request method and uri requested (the request line).
        Use a custom field here that will do the conversion from the log format to
        that required by the Hit class. (Method URI Stem URI Query)

        Need token count of 3 to concatenate the 3 fields together which is needed by the request line field.
      <field class="org.polliwog.fields.W3CRequestLineField"
            tokenCount="3" />

        The status code returned by the web server. (HTTP Status)
      <field class="org.polliwog.fields.StatusCodeField" />

        The size of the returned document. (Bytes Sent)
      <field class="org.polliwog.fields.SizeField" />

        The request header, i.e. what did the browser/search engine announce itself as.
        (User Agent)
      <field class="org.polliwog.fields.RequestHeaderField"
        <param id="type"
              value="user-agent" />

        The referer page. (Referer)
        Need to indicate that the field should be decoded.
      <field class="org.polliwog.fields.RefererHeaderField" />



Sign In or Register to comment.