Conditional Tag Search with RegEx Output

Limegreenman900 · May 2016

Hi everyone,

i am currently working on a big set of data (~ 4 million HTML files stored on my computer) and I am wondering if there is any search/parse fuction in RM that allows me to search all documents for a unique tag and IF the criteria is found THAN search in the same string for an regular expression that will match a certain number.

For example i do habe a string like:
<ix:nonFraction name="AuditFeesExpenses" contextRef="FY1.segment.bus-ThirdPartyAgentTypeDimension.bus-EntityAccountantsOrAuditorsGroupCompanyDimension.-Consolidated" unitRef="USD" xmlns:aurep="http://www.xbrl.org/reports/aurep/2009-09-01" decimals="0" format="ixt:numcommadot">14,825</ix:nonFraction>

I want to search for the tag "AuditFeesExpenses" and IF it is found RM should search for an regular expression that meets the criteria of the digit "14,825" (the RegEx is not my problem!).

Anyone of you have an idea if this is possible in RM?

Thanks!
Flo

MartinLiebig · May 2016

Just a quick idea: Why not pushing it into a solr server and then query it from RM? Sounds like a nice idea

Otherwise - you may check Extract Information or (funnily) Replace.

JEdward · May 2016

Yep, the Solr server sounds like the way to go with this one. It's designed for search & solving exactly this type of problem, you can install the extension from the marketplace.

But, assuming you don't want to use Solr (no... I really recommend you do for 4 million files), then here is a way to do it.
I would also suggest (from the file structure) that an XPath might also work better than a regular expression. Here's a quick example using your one below. You can use XPath both with the ReadXML operator, but for that many documents (if not using Solr) I would recommend using some Groovy Script within your workflow to process them.

In this example I convert from Html to XML, but you might not need this if your documents are already in well formatted XML. Give it a try on a couple of files.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:read_document" compatibility="7.0.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="136">
        <parameter key="file" value="C:\Users\user\Desktop\AuditFeesExpenses.html"/>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="content_type" value="html"/>
        <parameter key="encoding" value="UTF-8"/>
      </operator>
      <operator activated="true" class="text:html_to_xml" compatibility="7.0.000" expanded="true" height="68" name="Html To Xml" width="90" x="179" y="85"/>
      <operator activated="true" class="text:write_document" compatibility="7.0.000" expanded="true" height="82" name="Write Document" width="90" x="313" y="34">
        <parameter key="file" value="C:\Users\user\Desktop\AuditFeesExpenses.xml"/>
        <parameter key="encoding" value="UTF-8"/>
      </operator>
      <operator activated="true" class="advanced_file_connectors:read_xml" compatibility="7.1.000" expanded="true" height="68" name="Read XML" width="90" x="514" y="34">
        <parameter key="file" value="C:\Users\user\Desktop\AuditFeesExpenses.xml"/>
        <parameter key="xpath_for_examples" value="//html:html/html:body/ix:nonFraction[@name=&amp;quot;AuditFeesExpenses&quot;]"/>
        <enumeration key="xpaths_for_attributes">
          <parameter key="xpath_for_attribute" value="attribute::contextref"/>
          <parameter key="xpath_for_attribute" value="attribute::decimals"/>
          <parameter key="xpath_for_attribute" value="attribute::format"/>
          <parameter key="xpath_for_attribute" value="attribute::name"/>
          <parameter key="xpath_for_attribute" value="attribute::unitref"/>
          <parameter key="xpath_for_attribute" value="text()"/>
        </enumeration>
        <list key="namespaces">
          <parameter key="html" value="http://www.w3.org/1999/xhtml"/>
          <parameter key="ix" value="urn:x-prefix:ix"/>
        </list>
        <parameter key="default_namespace" value="urn:x-prefix:ix"/>
        <parameter key="grouped_digits" value="true"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Att1.false.nominal.attribute"/>
          <parameter key="1" value="Att2.false.nominal.attribute"/>
          <parameter key="2" value="Att3.false.nominal.attribute"/>
          <parameter key="3" value="Attribute.true.polynominal.attribute"/>
          <parameter key="4" value="Currency.true.polynominal.attribute"/>
          <parameter key="5" value="Value.true.real.attribute"/>
        </list>
      </operator>
      <connect from_op="Read Document" from_port="output" to_op="Html To Xml" to_port="document"/>
      <connect from_op="Html To Xml" from_port="document" to_op="Write Document" to_port="document"/>
      <connect from_op="Write Document" from_port="file" to_op="Read XML" to_port="file"/>
      <connect from_op="Read XML" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Limegreenman900 · May 2016

Perfect, I had a short glance at Solr and it seems to fit my needs!

Thanks for your code proposal but I think it will take too long to convert every document in a XML file first before processing it

MartinLiebig · May 2016

by the way, maybe this is also a use case for Apache drill. It has a jdbc connector and might work. Would be AMAZING to see this working

~Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Conditional Tag Search with RegEx Output

Answers