Options

Conditional Tag Search with RegEx Output

Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
edited November 2018 in Help
Hi everyone,

i am currently working on a big set of data (~ 4 million HTML files stored on my computer) and I am wondering if there is any search/parse fuction in RM that allows me to search all documents for a unique tag and IF the criteria is found THAN search in the same string for an regular expression that will match a certain number.

For example i do habe a string like:
<ix:nonFraction name="AuditFeesExpenses" contextRef="FY1.segment.bus-ThirdPartyAgentTypeDimension.bus-EntityAccountantsOrAuditorsGroupCompanyDimension.-Consolidated" unitRef="USD" xmlns:aurep="http://www.xbrl.org/reports/aurep/2009-09-01" decimals="0" format="ixt:numcommadot">14,825</ix:nonFraction>

I want to search for the tag "AuditFeesExpenses" and IF it is found RM should search for an regular expression that meets the criteria of the digit "14,825" (the RegEx is not my problem!).

Anyone of you have an idea if this is possible in RM?

Thanks!
Flo
Tagged:

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Just a quick idea: Why not pushing it into a solr server and then query it from RM? Sounds like a nice idea :)

    Otherwise - you may check Extract Information or (funnily) Replace.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Yep, the Solr server sounds like the way to go with this one.  It's designed for search & solving exactly this type of problem, you can install the extension from the marketplace. 

    But, assuming you don't want to use Solr (no... I really recommend you do for 4 million files), then here is a way to do it. 
    I would also suggest (from the file structure) that an XPath might also work better than a regular expression.  Here's a quick example using your one below.  You can use XPath both with the ReadXML operator, but for that many documents (if not using Solr) I would recommend using some Groovy Script within your workflow to process them. 

    In this example I convert from Html to XML, but you might not need this if your documents are already in well formatted XML.  Give it a try on a couple of files. 
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="7.0.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="136">
            <parameter key="file" value="C:\Users\user\Desktop\AuditFeesExpenses.html"/>
            <parameter key="extract_text_only" value="false"/>
            <parameter key="use_file_extension_as_type" value="false"/>
            <parameter key="content_type" value="html"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:html_to_xml" compatibility="7.0.000" expanded="true" height="68" name="Html To Xml" width="90" x="179" y="85"/>
          <operator activated="true" class="text:write_document" compatibility="7.0.000" expanded="true" height="82" name="Write Document" width="90" x="313" y="34">
            <parameter key="file" value="C:\Users\user\Desktop\AuditFeesExpenses.xml"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="advanced_file_connectors:read_xml" compatibility="7.1.000" expanded="true" height="68" name="Read XML" width="90" x="514" y="34">
            <parameter key="file" value="C:\Users\user\Desktop\AuditFeesExpenses.xml"/>
            <parameter key="xpath_for_examples" value="//html:html/html:body/ix:nonFraction[@name=&amp;quot;AuditFeesExpenses&quot;]"/>
            <enumeration key="xpaths_for_attributes">
              <parameter key="xpath_for_attribute" value="attribute::contextref"/>
              <parameter key="xpath_for_attribute" value="attribute::decimals"/>
              <parameter key="xpath_for_attribute" value="attribute::format"/>
              <parameter key="xpath_for_attribute" value="attribute::name"/>
              <parameter key="xpath_for_attribute" value="attribute::unitref"/>
              <parameter key="xpath_for_attribute" value="text()"/>
            </enumeration>
            <list key="namespaces">
              <parameter key="html" value="http://www.w3.org/1999/xhtml"/>
              <parameter key="ix" value="urn:x-prefix:ix"/>
            </list>
            <parameter key="default_namespace" value="urn:x-prefix:ix"/>
            <parameter key="grouped_digits" value="true"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Att1.false.nominal.attribute"/>
              <parameter key="1" value="Att2.false.nominal.attribute"/>
              <parameter key="2" value="Att3.false.nominal.attribute"/>
              <parameter key="3" value="Attribute.true.polynominal.attribute"/>
              <parameter key="4" value="Currency.true.polynominal.attribute"/>
              <parameter key="5" value="Value.true.real.attribute"/>
            </list>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Html To Xml" to_port="document"/>
          <connect from_op="Html To Xml" from_port="document" to_op="Write Document" to_port="document"/>
          <connect from_op="Write Document" from_port="file" to_op="Read XML" to_port="file"/>
          <connect from_op="Read XML" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • Options
    Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
    Perfect, I had a short glance at Solr and it seems to fit my needs!

    Thanks for your code proposal but I think it will take too long to convert every document in a XML file first before processing it ;)

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    by the way, maybe this is also a use case for Apache drill. It has a jdbc connector and might work. Would be AMAZING to see this working

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.