Letter count in sequence

BDk · November 2022

Hi, I'm quite new with the software. I would like to count the number of letter in a random sentence (e.g.:GGGAATCGTCA), e.g. how many 'A' occurred in it and put it into a new column. Is there some operator that could be used for it? Thank you in advance!

MarcoBarradas · December 2022

Hi @BDk

You can use a Process Documents and split the tokens and specify count occurrences

<?xml version="1.0" encoding="UTF-8"?><process version="9.10.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.011" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="-1"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="9.4.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
        <parameter key="text" value="GGGAATCGTCA"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="9.4.000" expanded="true" height="103" name="Process Documents" width="90" x="246" y="34">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="9.4.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
            <parameter key="mode" value="regular expression"/>
            <parameter key="characters" value=".:"/>
            <parameter key="expression" value="|"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

BDk · December 2022

Do I need some extension for this 'process documents' operator? I've an education version of the software and I could not find this operator.

BDk · December 2022

OK, found the extension, sorry. It works for 1 row fine, thanks Marco. Could it be multiplied?
I've a table that has 1000+ rows and all contains a letter sequence like the one that posted above. I would like to count the letters in each one by one. But with the posted solution it only works for 1 row or if I enter all the 1000+ via 'create document' it only counts the letters together in all rows...

MarcoBarradas · December 2022

HI @BDk

You'll need to use a Process Documents or Process Documents from Data or Files it depends on how your data was originally collected.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Letter count in sequence

Answers