"simple text extraction"

xtraplusxtraplus Member Posts: 20 Contributor II
edited May 2019 in Help
Hi,

I have one folder (I call it here prime) containing many folders of which some contain html-files. I want to read "prime" with "process documents from files" operator. Inside this operator I use "Extract information" Xpath: //h;*[contains(.,"@)]/.  Basically I want to extract the emails from my files.

I just give "process documents from files" the path to "prime" as text directory. Is that correct? I want the process to find the subfolders there with the files.

This is the code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="161" width="279">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="179" y="75">
        <list key="text_directories">
          <parameter key="all" value="C:\Users\Home\Desktop\Sites"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true" height="414" width="762">
          <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="279" y="96">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Mail" value="//h;*[contains(.,&quot;@&amp;quot;)]/."/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="36"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
When I start the process, then its finished after 0 s, without anything extracted.

How do you get it to work properly?

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello,

    As a start, replace the extract information operator with the tokenize operator.

    regards

    Andrew
  • xtraplusxtraplus Member Posts: 20 Contributor II
    Hi Andrew,

    I would like to do it as demonstrated in this video:

    http://www.youtube.com/watch?v=vKW5yd1eUpA&;feature=player_embedded


    when I hit start I get a process falied Message: "A DocType cannot be added after the root element"

    What does this mean?

    when I add a /* to my directory I dont get this message. But this is different from the video. However

    when I start the process, then its finished after 0 s, without anything extracted.



    why should I use "tokenize" instead ?  I want to use a complex Xpath query to extract certain information
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    The Xpath has to work on something but I can't work out what the input XML looks like  (and Xpath is one of the "dirty dozen development" thiings that mere mortals should never have to worry about).

    Knowing this "eternal verity", I tend to make everything look like a spreadsheet and then work from there.

    Try tokenize and see what happens (it might not help but without the input data, it's difficult to say).

    Cheers,

    Andrew
  • xtraplusxtraplus Member Posts: 20 Contributor II
    Hi,

    I tried tokenize, but nothig gets extracted. My input is just html-files downloaded per "web crawl"

    How do i make the htmls look like spreadsheets, please?
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    It's difficult to say without the data but I would try some simpler XPath first and build from there. You could also set a breakpoint before and after the extract operator to see if this gives insight into what is happening.

    regards

    Andrew
  • colocolo Member Posts: 236 Maven
    xtraplus wrote:

    when I hit start I get a process falied Message: "A DocType cannot be added after the root element"

    What does this mean?

    Hi,

    I received this error from time to time when crawling lots of pages, where some of them generated script errors. If possible visit the URL for page causing the process to stop in your browser. When the problem appeared for me, there were PHP error messages contained on the page. They were put at the very beginning of the generated HTML document, thus making the document invalid. The XPath interpreter seems to be restrictive with that. The error message says, that a doctype is declared at a point where this isn't allowed. This means something has to be found prior to this declaration (which should usually be the first line). I wish this error would be ignored and the page would be skipped, but unfortunately this aborts the whole process.
    You can probably use "Handle Exception" to keep the process running, but since the page may contain interesting content although an error was generated, I used another approach. I just used a "Replace" operator for each page, replacing "(?is).*?(<!doctype)" by "$1", which should remove anything in front of the doctype declaration. This needs some additional computing time, but helped a lot for me.

    Regards
    Matthias
  • xtraplusxtraplus Member Posts: 20 Contributor II
    Hi Andrew and Matthias,

    thanks, the breakpoint method seemed to work. I found corrupting html-files.

    When I filter with

    //h:a[contains(@href,"@)]

    I get with rapidminer:

    <a xmlns="http://www.w3.org/1999/xhtml" shape="rect" href="mailto:abc@abc.com">abc@abc.com</a>

    when I do the same XPath query with google.docs, then I just get:

    abc@abc.com

    How can I get the google.docs results in rapidminer?

    Where do I have to place the "handle exception" operater in order to catch it? It didnt seem to work for the places I tried

    Is there more involved in catching exceptions than placing the "handle exception" operator ?
  • colocolo Member Posts: 236 Maven
    Hi,

    do you receive the results from Google as plain text or as hyperlink? Maybe the HTML code is just converted into a link? The XPath expression you are using should usually give you the whole a-tag not just the text or the href-attribute. To get them, you can either append /text() for the link text, or preferably /@href for the content of the href-attribute.
    If you put the operators that may cause an exception inside the "Handle Exception" operator, this should work. But I tested this only once some time ago. Later I always tried to adjust or "repair" the data, that might cause the problems for some operators. But the success of this depends on the error source and how creative it is ;)

    Regards
    Matthias
  • xtraplusxtraplus Member Posts: 20 Contributor II
    Hi Matthias,

    thanks, though the exception handle around the causing operator does not work to prevent the process to fail.

    It could be I have too many corrupting html-files to sort them out by hand.

    Unfortunately the replacing takes too much time processing

    sorting them out by hand is probably my last option.

    I get the "A DocType cannot be added after the root element" exception in an irrational manner.

    One time the process fails at application 70, then I sort 70 out and next I get the process failing at 69 and so on.
  • colocolo Member Posts: 236 Maven
    xtraplus wrote:

    One time the process fails at application 70, then I sort 70 out and next I get the process failing at 69 and so on.
    Hi,

    exactly because of this fact, replacing the errors seemed to be a good solution for me. But I must agree, the runtime of the replace argument I first posted is far too high, since the whole document is scanned for the pattern. I also faced the runtime problem for my first attempts and the solution wasn't very tricky. I'm sorry, it seems I copied the regex from one of the early processes. You just have to add one symbol to scan only the beginning of the document. Try this, it should speed things up dramatically:
    (?is)^.*?(<!doctype)
    Regards
    Matthias
  • xtraplusxtraplus Member Posts: 20 Contributor II
    Hi

    thanks, though meanwhile I sorted it out manually. I had a couple of different errors. Replacing alone probably would not had worked for me.

    I also would like to try it with a regular expression.

    What is an equivalent Regular expression for //h:*[contains(@href,"@)]/@href

    Does this make sense?
  • colocolo Member Posts: 236 Maven
    Hi,

    quickly written - and without testing it - this might work:
    <a[^>]*href\s*=\s*['"](.*?)['"][^>]*>
    But the processing time will certainly be pretty high...

    Replacing with my previously posted regex should at least eliminate all errors from this type: "A DocType cannot be added after the root element"

    Regards
    Matthias

    Edit: Oops, the check for the @ sign is of course missing. This one would require the use of an assertion. I am currently not having the time to look this up, since I don't use them often. Otherwise you could collect all href-values like above and then check them in a second step.
  • xtraplusxtraplus Member Posts: 20 Contributor II
    Hi

    thanks

    do you know a good site, where i can look up regular expressions, please?

  • colocolo Member Posts: 236 Maven
    Hi,

    this one should contain some useful information: http://www.regular-expressions.info/

    There are some other sites in German language, which should not be a problem for you ;)
    http://www.regenechsen.de/phpwcms/index.php?regex_allg
    http://www.sql-und-xml.de/regex/

    Regards
    Matthias
Sign In or Register to comment.