Sentence analysis - keeping origin of sentence?

xt_crayxt_cray Member Posts: 20 Contributor I
edited December 2018 in Help

Hello there,

 

after successfully breaking documents into sentences, I stumbled over a problem which didn't seem one at first.

When extracting the sentences with the linguistic sentence tokenizer, I need to de-pivot it, so I get the sentences actually into rows. So far so good, however I can't keep the text itself (which would be possible due to the option in the "Process Documents" operator) since the de-pivot operator brings it to a stop because of mismatching types (obviously I think I haven't completely understood that operator), neither I can keep the title of the document (which is also an attribute but gets lost during tokenizing). So is there a way of keeping either the text attribute (which contains the complete text) or (which might be even more suitable) keep the document title attribute which is there when retrieving, but gets lost during processing?

 

Maybe this can be usful for understanding what I'm doing - my process looks like this:

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve (2)" width="90" x="45" y="34">
<parameter key="repository_entry" value="../Data/HighQualityTestTexts"/>
</operator>
<operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="179" y="34"/>
<operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="136">
<parameter key="attribute" value="Sentences"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
<parameter key="attribute_name" value="id"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="313" y="34">
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="false" class="text:aggregate_token_length" compatibility="7.5.000" expanded="true" height="68" name="Aggregate Token Length (2)" width="90" x="715" y="34">
<parameter key="aggregation" value="count"/>
</operator>
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="45" y="34">
<parameter key="minimum_text_block_length" value="3"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="179" y="34">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="remove_useless_attributes" compatibility="7.6.001" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="313" y="136"/>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="447" y="34">
<parameter key="attribute" value="id"/>
<parameter key="attributes" value="Title|Language|Description|Keywords|Robots|Id"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="de_pivot" compatibility="7.6.001" expanded="true" height="82" name="De-Pivot" width="90" x="581" y="34">
<list key="attribute_name">
<parameter key="ProzentualerAnteil" value=".*"/>
</list>
<parameter key="index_attribute" value="Sentences"/>
<parameter key="create_nominal_index" value="true"/>
<parameter key="keep_missings" value="true"/>
</operator>
<connect from_op="Retrieve (2)" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data (3)" to_port="example set"/>
<connect from_op="Process Documents from Data (3)" from_port="example set" to_op="Remove Useless Attributes" to_port="example set input"/>
<connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Select Attributes (3)" to_port="example set input"/>
<connect from_op="Select Attributes (3)" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
<connect from_op="De-Pivot" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Thanks in advance.

 

Oliver

Best Answer

  • kaymankayman Member Posts: 662 Unicorn
    Solution Accepted

    hi @xt_cray,

     

    What about this ? Loop the examples, store the attributes you want to add as macros, generate your sentences and stich all together again?

     

    Below example shows this using your current data. I had to make up the source data so I may have missed some details but I believe it can put you in the right direction.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="source data" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="Anzahl_Saetze_Text" value="81"/>
    <parameter key="ConfluenceSpaceName" value="&quot;Cognos&quot;"/>
    <parameter key="ConfluenceTitle" value="&quot;BI - Installation - Cognos&quot;"/>
    <parameter key="att_1.0" value="1"/>
    <parameter key="SourceText" value="&quot;just some dummy strings, these will go to your word processor&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="loop_examples" compatibility="7.6.001" expanded="true" height="103" name="Loop Examples" width="90" x="179" y="34">
    <parameter key="iteration_macro" value="curr_ex"/>
    <process expanded="true">
    <operator activated="true" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="Filter Example Range" width="90" x="45" y="34">
    <parameter key="first_example" value="%{curr_ex}"/>
    <parameter key="last_example" value="%{curr_ex}"/>
    <description align="center" color="transparent" colored="false" width="126">get current example (using the itteration macro value)</description>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="179" y="34">
    <parameter key="macro" value="Anzahl_Saetze_Text"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="Anzahl_Saetze_Text"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros">
    <parameter key="ConfluenceSpaceName" value="ConfluenceSpaceName"/>
    <parameter key="ConfluenceTitle" value="ConfluenceTitle"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">store the current examples main attributes in memory</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.6.001" expanded="true" height="82" name="sentence maker" width="90" x="313" y="34">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="7.6.001" expanded="true" height="68" name="Read CSV" width="90" x="380" y="34">
    <parameter key="csv_file" value="C:\currentresult.csv"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information"/>
    <description align="center" color="transparent" colored="false" width="126">This is your current process, so the result is as you have them now.</description>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
    <list key="function_descriptions">
    <parameter key="ConfluenceSpaceName" value="%{ConfluenceSpaceName}"/>
    <parameter key="Anzahl_Saetze_Text" value="%{Anzahl_Saetze_Text}"/>
    <parameter key="ConfluenceTitle" value="%{ConfluenceTitle}"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Add the stored variables as new attributes</description>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">your current process</description>
    </operator>
    <connect from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="sentence maker" to_port="in 1"/>
    <connect from_op="sentence maker" from_port="out 1" to_port="output 1"/>
    <portSpacing port="source_example set" spacing="0"/>
    <portSpacing port="sink_example set" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="860" y="119">Watch the out port, do not attach to the exa unless you need your current example</description>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="7.6.001" expanded="true" height="82" name="Append" width="90" x="313" y="34"/>
    <connect from_op="source data" from_port="output" to_op="Loop Examples" to_port="example set"/>
    <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

Answers

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    Try the "keep text" parameter of the "Process Documents" operator.


    Andrew

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    And also the "Add meta information" option, which will store the filename of the original document (if your document was loaded from a file).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • xt_crayxt_cray Member Posts: 20 Contributor I

    First, thank you for the hints. Unfortunately both don't work - well, of course they do, but as soon as I use the De-Pivot Operator, it shows me nothing anymore...

    What I get without the De-Pivot is and when using "keep text", "add meta information" and "Aggregate Token Length" is:

    Row No. id text                 token_length Sentence1 Sentence 2 Sentence3....

    1            1  completetext1     307           0              0                0

    2            2  completetext2     353           0              0                0.047

    3            3  completetext3     408           0              0.033          0

    ....

     

    What I would like to have is (after de-Pivoting):

    Row No. Sentence   Completetext   TextTitle  token_length

    1            Sentence1 Completetext1  Title1      307

    2            Sentence2 Completetext1  Title1      353

    3            Sentence3 Completetext2  Title2      408

    ...

     

    However, when I use De-Pivoting, this only works when "add meta information" and "keep text" as well as the "Aggregate Token Length" Operator are turned off. If I use them, the de-Pivoting fails by showing nothing. Also I recognized, that my previous attributes seem to vanish due to the Extract Information operator within the Process Document from Data operator (In my repository, the text content ist actually with HTML-Tags. Or should the repository better completely clean of that, so that I won't have to use the Extract Information operator?

    I guess, there may be some more operators involved, but actually I don't know how to progress here.

     

    Oliver

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    OK, I threw together this quick process for you to see how it should be put together.  I think this is what you want: it takes a document, parses it into sentences, creating a separate document from each sentence.  Then there is some cleanup which is neeed to transpose it and give it the correct structure (add an id, make the data type text, etc.), and then it tokenizes those sentences into words, and generates the token (word) count for each sentence.  The resulting dataset has simply an id, the original sentence text, and the number of tokens.  If you want the counts of the individual tokens, just check the "create word vector" parameter in the last operator (although then you will probably want to do some additional pre-processing like case transformations, etc.).

    You should be able to easily adapt this to your input data.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
    <parameter key="text" value="This is a sample document. It has three sentences in it. That's for testing purposes only and is just some example data."/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="246" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="mode" value="linguistic sentences"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.6.001" expanded="true" height="82" name="Sentence Cleanup" width="90" x="380" y="34">
    <process expanded="true">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="7.6.001" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
    <parameter key="attribute_name" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.6.001" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
    <parameter key="old_name" value="id"/>
    <parameter key="new_name" value="sentence"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="sentence"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="514" y="136"/>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="648" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att_1"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <connect from_port="in 1" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="85"/>
    <operator activated="true" class="text:extract_token_number" compatibility="7.5.000" expanded="true" height="68" name="Extract Token Number" width="90" x="313" y="85"/>
    <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
    <connect from_op="Tokenize (2)" from_port="document" to_op="Extract Token Number" to_port="document"/>
    <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_op="Sentence Cleanup" to_port="in 1"/>
    <connect from_op="Sentence Cleanup" from_port="out 1" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • xt_crayxt_cray Member Posts: 20 Contributor I

    Ok, thank you  - that provides some clues.

    I played around with it and now I will try to figure out, how to keep my attributes from the first Process Document...-operator, because I have serveral documents in the repository and need to keep the origin from these, so that I can see which sentences comes from which text.

    So I will try and if I can't get it working, ask again ;-)

     

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    After playing around with the start process I was given, I figured out, how to get my number of setences in the document etc. - everything is fine until that point, when the transpose comes into play - it does what it should and sets a row for each sentence, displaying the word count of that and I'm quite happy with that.

    Yet, what I coulnd't figure out yet, is, how to add an attribute to that example set after transpose - in the way that I already have (or rather "had") it already. The thing is, that I generated in the  first Process Document operator also the number of sentences of a document - I narrowed my dataset down to one article/text only, because I think this should be enough to start with. Complexity can come later,when I have understood this problem...

    So there is no changing title of a text for certain rows, but a number which I would like to have in a new column (like you can create with the "Generate Attribute" operator) after the text has been transposed, which makes it of course losing the attributes from the step before.

    I tried to use the "Generate Empty Attribute" after the word count in each sentence (Process Document operator) and then but failed to read in the value for each row. Also generating as well as selecting an attribute failed, because I managed to get the attributes "around" the transpose step, but then it would either not be possible to merge it into the example set or only adds once (which is quite logical, because there's only one entry, before the sentences are created).

    So how could I achieve to get in an attribute column, filled from the first till the last row with the value of the attribute already generated in the first Process-Document-step?

     

    Oliver

  • kaymankayman Member Posts: 662 Unicorn

    would you have a sample of what you have now, and what you like to get ? Just quick and dirty in text format is ok, so it becomes a bit more visible.

  • xt_crayxt_cray Member Posts: 20 Contributor I

    Sure - thanks for your reply.

    So what I have at the moment in CSV is:

     

    att_1.0;WortanzahlSatz;text;id
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0

    The attribute "WortanzahlSatz" is the number of words from each sentence in the corresponding row, "text" is actually the corresponding sentence.

     

    The most ideal result in the end would be (also in a - handmade - CSV):

     

    att_1.0;WortanzahlSatz;text;id;Anzahl_Saetze_Text;ConfluenceSpaceName;ConfluenceTitle
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0;81.0;Cognos;BI - Installation - Cognos
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0;81.0;Cognos;BI - Installation - Cognos
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0;81.0;Cognos;BI - Installation - Cognos

    So, the additional attributes "Anzahl_Saetze_Text", "ConfluenceSpaceName" and "ConfluenceTitle" I have already in the beginning (they get lost during transponding). "Anzahl_Saetze_Text" is the number of sentences in the text/document I read from the repository (currently only one text so computation won't take too long), "ConfluenceSpaceName" is the origin "folder" (actually the space of a wiki I got this text from) and "ConfluenceTitle" is actually the header of this document/text.

    However I would already be happy if I can get the number of sentences in the text/document - in case it's not possible to get all additional three attributes in the dataset.

    The process looks like this at the moment (with the multiply operator I try to "move" the attributes I'd like to keep around the transponding process):

    temp.pngbreaking_into_sentences

     

    So that's the current status - I found a loop operator and since the rows have to be filled, maybe that  could be a possiblity. Yet, I'm still trying to figure out, how to get it properly working...

     

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    Sure - thanks for replying.

    So what I get currently (I put it here in form of a CSV):

    att_1.0;WortanzahlSatz;text;id
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0

    The attribute "WortanzahlSatz" is actually the number of words in the corresponding sentence (each row).

    What I would like to have ideally (also in form of a "handmade" CSV as an extension to the current result):

    att_1.0;WortanzahlSatz;text;id;Anzahl_Saetze_Text;ConfluenceSpaceName;ConfluenceTitle
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0;81.0;Cognos;BI - Installation - Cognos
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0;81.0;Cognos;BI - Installation - Cognos
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0;81.0;Cognos;BI - Installation - Cognos
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0;81.0;Cognos;BI - Installation - Cognos
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0;81.0;Cognos;BI - Installation - Cognos

    The meaning of the additional attributes (which I already have in the beginning): "Anzahl_Saetze_Text" is the number of sentences in the whole text which comes from the repository (at the moment only one document to keep computation time down), "ConfluenceSpaceName" is the origin "folder" of the text (actually a space in the wiki I got the text extracted from) and "ConfluenceTitle" is the header/title of the document.

    However I would already be happy if I can get the number of sentences in the whole text as another column with values in the final dataset - as said, all attributes would be ideal. :-)

    So the current process looks like this:

    rm_ps.png

    The thing is of course, the the attributes which vanish after transponding have only one value (row) before the splitting into sentences happens. So I think there must be sth. used like a loop to fill the rest of the rows. I found such an operator, but have not been able yet to figure out how to use it properly - but maybe I'm also on the wrong way here...

    Any hints much appreciated.

     

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    Sure - thanks for replying.

    So what I get currently (I put it here in form of a CSV):

     

    att_1.0;WortanzahlSatz;text;id
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0

    The attribute "WortanzahlSatz" is actually the number of words in the corresponding sentence (each row).

     

    What I would like to have ideally (also in form of a "handmade" CSV as an extension to the current result):

     

    att_1.0;WortanzahlSatz;text;id;Anzahl_Saetze_Text;ConfluenceSpaceName;ConfluenceTitle
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0;81.0;Cognos;BI - Installation - Cognos
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0;81.0;Cognos;BI - Installation - Cognos
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0;81.0;Cognos;BI - Installation - Cognos
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0;81.0;Cognos;BI - Installation - Cognos
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0;81.0;Cognos;BI - Installation - Cognos

    The meaning of the additional attributes (which I already have in the beginning): "Anzahl_Saetze_Text" is the number of sentences in the whole text which comes from the repository (at the moment only one document to keep computation time down), "ConfluenceSpaceName" is the origin "folder" of the text (actually a space in the wiki I got the text extracted from) and "ConfluenceTitle" is the header/title of the document.

     

    However I would already be happy if I can get the number of sentences in the whole text as another column with values in the final dataset - as said, all attributes would be ideal. :-)

    So the current process looks like this:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Trainingset" width="90" x="45" y="34">
    <parameter key="repository_entry" value="../Data/SingleArticle_Training"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="179" y="34">
    <parameter key="create_nominal_ids" value="false"/>
    <parameter key="offset" value="0"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="179" y="136">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value="Satzinhalt"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="313" y="34">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_absolute" value="2"/>
    <parameter key="prune_above_absolute" value="9999"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="Inhaltstext" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="45" y="34">
    <parameter key="extract_content" value="true"/>
    <parameter key="minimum_text_block_length" value="3"/>
    <parameter key="override_content_type_information" value="true"/>
    <parameter key="neglegt_span_tags" value="true"/>
    <parameter key="neglect_p_tags" value="true"/>
    <parameter key="neglect_b_tags" value="true"/>
    <parameter key="neglect_i_tags" value="true"/>
    <parameter key="neglect_br_tags" value="true"/>
    <parameter key="ignore_non_html_tags" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="380" y="34">
    <parameter key="mode" value="linguistic sentences"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="34">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.5.000" expanded="true" height="68" name="Anzahl_Saetze_Text" width="90" x="648" y="34">
    <parameter key="metadata_key" value="Anzahl_Saetze_Text"/>
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (6)" width="90" x="380" y="136">
    <parameter key="mode" value="linguistic tokens"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="514" y="136">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.5.000" expanded="true" height="68" name="Wortzahl_Text" width="90" x="648" y="136">
    <parameter key="metadata_key" value="Wortzahl_Text"/>
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="join_paths" compatibility="7.6.001" expanded="true" height="103" name="Join Paths" width="90" x="849" y="34"/>
    <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
    <connect from_op="Extract Content (2)" from_port="document" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Tokenize (3)" to_port="document"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Tokenize (6)" to_port="document"/>
    <connect from_op="Tokenize (3)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_op="Anzahl_Saetze_Text" to_port="document"/>
    <connect from_op="Anzahl_Saetze_Text" from_port="document" to_op="Join Paths" to_port="input 1"/>
    <connect from_op="Tokenize (6)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
    <connect from_op="Filter Tokens (6)" from_port="document" to_op="Wortzahl_Text" to_port="document"/>
    <connect from_op="Wortzahl_Text" from_port="document" to_op="Join Paths" to_port="input 2"/>
    <connect from_op="Join Paths" from_port="output" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="187"/>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value="id"/>
    <parameter key="attributes" value="Title|Language|Description|Keywords|Robots|Id"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (4)" width="90" x="581" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="select_attributes_and_weights" value="false"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (4)" width="90" x="112" y="34">
    <parameter key="mode" value="linguistic tokens"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="246" y="34">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:extract_token_number" compatibility="7.5.000" expanded="true" height="68" name="Extract Token Number (3)" width="90" x="447" y="34">
    <parameter key="metadata_key" value="WortanzahlSatz"/>
    <parameter key="condition" value="all"/>
    <parameter key="case_sensitive" value="false"/>
    <parameter key="invert_condition" value="false"/>
    </operator>
    <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
    <connect from_op="Tokenize (4)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_op="Extract Token Number (3)" to_port="document"/>
    <connect from_op="Extract Token Number (3)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="subprocess" compatibility="7.6.001" expanded="true" height="82" name="Sentence Cleanup" width="90" x="581" y="136">
    <process expanded="true">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (4)" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="text"/>
    <parameter key="attributes" value="text"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="7.6.001" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (5)" width="90" x="313" y="34">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="regular"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.6.001" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
    <parameter key="old_name" value="id"/>
    <parameter key="new_name" value="sentence"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text (4)" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="sentence"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="715" y="34">
    <parameter key="create_nominal_ids" value="false"/>
    <parameter key="offset" value="0"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (5)" width="90" x="849" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="attn"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <connect from_port="in 1" to_op="Select Attributes (4)" to_port="example set input"/>
    <connect from_op="Select Attributes (4)" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Set Role (5)" to_port="example set input"/>
    <connect from_op="Set Role (5)" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Nominal to Text (4)" to_port="example set input"/>
    <connect from_op="Nominal to Text (4)" from_port="example set output" to_op="Generate ID (2)" to_port="example set input"/>
    <connect from_op="Generate ID (2)" from_port="example set output" to_op="Select Attributes (5)" to_port="example set input"/>
    <connect from_op="Select Attributes (5)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (9)" width="90" x="581" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="Anzahl_Sätze_Text"/>
    <parameter key="attributes" value="id|Anzahl_Saetze_Text|ConfluenceTitle"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" breakpoints="after" class="join" compatibility="7.6.001" expanded="true" height="82" name="Join" width="90" x="849" y="34">
    <parameter key="remove_double_attributes" value="true"/>
    <parameter key="join_type" value="inner"/>
    <parameter key="use_id_attribute_as_key" value="true"/>
    <list key="key_attributes"/>
    <parameter key="keep_both_join_attributes" value="false"/>
    </operator>
    </process>

    I use the multiply operator to "keep" the attributes and pass them by the transponding - I think then there is the "magic" point where the generation of the rows has to happen...

     

    The thing is of course, the the attributes which vanish after transponding have only one value (row) before the splitting into sentences happens. So I think there must be sth. used like a loop to fill the rest of the rows. I found such an operator, but have not been able yet to figure out how to use it properly - but maybe I'm also on the wrong way here...

    Any hints much appreciated.

     

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    Sure - thanks for replying.

    So what I get currently (I put it here in form of a CSV):

    att_1.0;WortanzahlSatz;text;id
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0

    The attribute "WortanzahlSatz" is actually the number of words in the corresponding sentence (each row).

    What I would like to have ideally (also in form of a "handmade" CSV as an extension to the current result):

    att_1.0;WortanzahlSatz;text;id;Anzahl_Saetze_Text;ConfluenceSpaceName;ConfluenceTitle
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0;81.0;Cognos;BI - Installation - Cognos
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0;81.0;Cognos;BI - Installation - Cognos
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0;81.0;Cognos;BI - Installation - Cognos
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0;81.0;Cognos;BI - Installation - Cognos
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0;81.0;Cognos;BI - Installation - Cognos

    The meaning of the additional attributes (which I already have in the beginning): "Anzahl_Saetze_Text" is the number of sentences in the whole text which comes from the repository (at the moment only one document to keep computation time down), "ConfluenceSpaceName" is the origin "folder" of the text (actually a space in the wiki I got the text extracted from) and "ConfluenceTitle" is the header/title of the document.
    However I would already be happy if I can get the number of sentences in the whole text as another column with values in the final dataset - as said, all attributes would be ideal. :-)

    So the current process looks like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Trainingset" width="90" x="45" y="34">
    <parameter key="repository_entry" value="../Data/SingleArticle_Training"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="179" y="34">
    <parameter key="create_nominal_ids" value="false"/>
    <parameter key="offset" value="0"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="179" y="136">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value="Satzinhalt"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="313" y="34">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_absolute" value="2"/>
    <parameter key="prune_above_absolute" value="9999"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="Inhaltstext" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="45" y="34">
    <parameter key="extract_content" value="true"/>
    <parameter key="minimum_text_block_length" value="3"/>
    <parameter key="override_content_type_information" value="true"/>
    <parameter key="neglegt_span_tags" value="true"/>
    <parameter key="neglect_p_tags" value="true"/>
    <parameter key="neglect_b_tags" value="true"/>
    <parameter key="neglect_i_tags" value="true"/>
    <parameter key="neglect_br_tags" value="true"/>
    <parameter key="ignore_non_html_tags" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="380" y="34">
    <parameter key="mode" value="linguistic sentences"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="34">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.5.000" expanded="true" height="68" name="Anzahl_Saetze_Text" width="90" x="648" y="34">
    <parameter key="metadata_key" value="Anzahl_Saetze_Text"/>
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (6)" width="90" x="380" y="136">
    <parameter key="mode" value="linguistic tokens"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="514" y="136">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.5.000" expanded="true" height="68" name="Wortzahl_Text" width="90" x="648" y="136">
    <parameter key="metadata_key" value="Wortzahl_Text"/>
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="join_paths" compatibility="7.6.001" expanded="true" height="103" name="Join Paths" width="90" x="849" y="34"/>
    <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
    <connect from_op="Extract Content (2)" from_port="document" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Tokenize (3)" to_port="document"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Tokenize (6)" to_port="document"/>
    <connect from_op="Tokenize (3)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_op="Anzahl_Saetze_Text" to_port="document"/>
    <connect from_op="Anzahl_Saetze_Text" from_port="document" to_op="Join Paths" to_port="input 1"/>
    <connect from_op="Tokenize (6)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
    <connect from_op="Filter Tokens (6)" from_port="document" to_op="Wortzahl_Text" to_port="document"/>
    <connect from_op="Wortzahl_Text" from_port="document" to_op="Join Paths" to_port="input 2"/>
    <connect from_op="Join Paths" from_port="output" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="187"/>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value="id"/>
    <parameter key="attributes" value="Title|Language|Description|Keywords|Robots|Id"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (4)" width="90" x="581" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="select_attributes_and_weights" value="false"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (4)" width="90" x="112" y="34">
    <parameter key="mode" value="linguistic tokens"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="246" y="34">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:extract_token_number" compatibility="7.5.000" expanded="true" height="68" name="Extract Token Number (3)" width="90" x="447" y="34">
    <parameter key="metadata_key" value="WortanzahlSatz"/>
    <parameter key="condition" value="all"/>
    <parameter key="case_sensitive" value="false"/>
    <parameter key="invert_condition" value="false"/>
    </operator>
    <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
    <connect from_op="Tokenize (4)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_op="Extract Token Number (3)" to_port="document"/>
    <connect from_op="Extract Token Number (3)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="subprocess" compatibility="7.6.001" expanded="true" height="82" name="Sentence Cleanup" width="90" x="581" y="136">
    <process expanded="true">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (4)" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="text"/>
    <parameter key="attributes" value="text"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="7.6.001" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (5)" width="90" x="313" y="34">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="regular"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.6.001" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
    <parameter key="old_name" value="id"/>
    <parameter key="new_name" value="sentence"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text (4)" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="sentence"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="715" y="34">
    <parameter key="create_nominal_ids" value="false"/>
    <parameter key="offset" value="0"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (5)" width="90" x="849" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="attn"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <connect from_port="in 1" to_op="Select Attributes (4)" to_port="example set input"/>
    <connect from_op="Select Attributes (4)" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Set Role (5)" to_port="example set input"/>
    <connect from_op="Set Role (5)" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Nominal to Text (4)" to_port="example set input"/>
    <connect from_op="Nominal to Text (4)" from_port="example set output" to_op="Generate ID (2)" to_port="example set input"/>
    <connect from_op="Generate ID (2)" from_port="example set output" to_op="Select Attributes (5)" to_port="example set input"/>
    <connect from_op="Select Attributes (5)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (9)" width="90" x="581" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="Anzahl_Sätze_Text"/>
    <parameter key="attributes" value="id|Anzahl_Saetze_Text|ConfluenceTitle"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" breakpoints="after" class="join" compatibility="7.6.001" expanded="true" height="82" name="Join" width="90" x="849" y="34">
    <parameter key="remove_double_attributes" value="true"/>
    <parameter key="join_type" value="inner"/>
    <parameter key="use_id_attribute_as_key" value="true"/>
    <list key="key_attributes"/>
    <parameter key="keep_both_join_attributes" value="false"/>
    </operator>
    </process>

    I use the multiply operator to "keep" the attributes and pass them by the transponding - I think then there is the "magic" point where the generation of the rows has to happen...

    The thing is of course, the the attributes which vanish after transponding have only one value (row) before the splitting into sentences happens. So I think there must be sth. used like a loop to fill the rest of the rows. I found such an operator, but have not been able yet to figure out how to use it properly - but maybe I'm also on the wrong way here...

    Any hints much appreciated.

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    I somehow can't add a reply here - it vanishes all the time I try to post it, so I'll try to split it into smaller posts....

    First, thanks for replying.

    So what I get currently (I put it here in form of a CSV):

    att_1.0;WortanzahlSatz;text;id
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0
  • xt_crayxt_cray Member Posts: 20 Contributor I

    In case there are appearing three posts of the same reply - I posted the reply and it vanished.... So I don't know if they will appear later that night...

  • xt_crayxt_cray Member Posts: 20 Contributor I

    So since my posts haven't appeared for 15 minutes, I'll give it another try...

    So what I get currently (I put it here in form of a CSV):

     

    att_1.0;WortanzahlSatz;text;id
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0

    The attribute "WortanzahlSatz" is actually the number of words in the corresponding sentence (each row).

    What I would like to have ideally (also in form of a "handmade" CSV as an extension to the current result):

    att_1.0;WortanzahlSatz;text;id;Anzahl_Saetze_Text;ConfluenceSpaceName;ConfluenceTitle
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0;81.0;Cognos;BI - Installation - Cognos
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0;81.0;Cognos;BI - Installation - Cognos
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0;81.0;Cognos;BI - Installation - Cognos
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0;81.0;Cognos;BI - Installation - Cognos
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0;81.0;Cognos;BI - Installation - Cognos
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0;81.0;Cognos;BI - Installation - Cognos
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0;81.0;Cognos;BI - Installation - Cognos

    The meaning of the additional attributes (which I already have in the beginning): "Anzahl_Saetze_Text" is the number of sentences in the whole text which comes from the repository (at the moment only one document to keep computation time down), "ConfluenceSpaceName" is the origin "folder" of the text (actually a space in the wiki I got the text extracted from) and "ConfluenceTitle" is the header/title of the document.
    However I would already be happy if I can get the number of sentences in the whole text as another column with values in the final dataset - as said, all attributes would be ideal. :-)

    So the current process looks like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Trainingset" width="90" x="45" y="34">
    <parameter key="repository_entry" value="../Data/SingleArticle_Training"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="179" y="34">
    <parameter key="create_nominal_ids" value="false"/>
    <parameter key="offset" value="0"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="179" y="136">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value="Satzinhalt"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (3)" width="90" x="313" y="34">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_absolute" value="2"/>
    <parameter key="prune_above_absolute" value="9999"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="Inhaltstext" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="45" y="34">
    <parameter key="extract_content" value="true"/>
    <parameter key="minimum_text_block_length" value="3"/>
    <parameter key="override_content_type_information" value="true"/>
    <parameter key="neglegt_span_tags" value="true"/>
    <parameter key="neglect_p_tags" value="true"/>
    <parameter key="neglect_b_tags" value="true"/>
    <parameter key="neglect_i_tags" value="true"/>
    <parameter key="neglect_br_tags" value="true"/>
    <parameter key="ignore_non_html_tags" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="380" y="34">
    <parameter key="mode" value="linguistic sentences"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="34">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.5.000" expanded="true" height="68" name="Anzahl_Saetze_Text" width="90" x="648" y="34">
    <parameter key="metadata_key" value="Anzahl_Saetze_Text"/>
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (6)" width="90" x="380" y="136">
    <parameter key="mode" value="linguistic tokens"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="514" y="136">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.5.000" expanded="true" height="68" name="Wortzahl_Text" width="90" x="648" y="136">
    <parameter key="metadata_key" value="Wortzahl_Text"/>
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="join_paths" compatibility="7.6.001" expanded="true" height="103" name="Join Paths" width="90" x="849" y="34"/>
    <connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
    <connect from_op="Extract Content (2)" from_port="document" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Tokenize (3)" to_port="document"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Tokenize (6)" to_port="document"/>
    <connect from_op="Tokenize (3)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_op="Anzahl_Saetze_Text" to_port="document"/>
    <connect from_op="Anzahl_Saetze_Text" from_port="document" to_op="Join Paths" to_port="input 1"/>
    <connect from_op="Tokenize (6)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
    <connect from_op="Filter Tokens (6)" from_port="document" to_op="Wortzahl_Text" to_port="document"/>
    <connect from_op="Wortzahl_Text" from_port="document" to_op="Join Paths" to_port="input 2"/>
    <connect from_op="Join Paths" from_port="output" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="187"/>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value="id"/>
    <parameter key="attributes" value="Title|Language|Description|Keywords|Robots|Id"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (4)" width="90" x="581" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="select_attributes_and_weights" value="false"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (4)" width="90" x="112" y="34">
    <parameter key="mode" value="linguistic tokens"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="German"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="246" y="34">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <operator activated="true" class="text:extract_token_number" compatibility="7.5.000" expanded="true" height="68" name="Extract Token Number (3)" width="90" x="447" y="34">
    <parameter key="metadata_key" value="WortanzahlSatz"/>
    <parameter key="condition" value="all"/>
    <parameter key="case_sensitive" value="false"/>
    <parameter key="invert_condition" value="false"/>
    </operator>
    <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
    <connect from_op="Tokenize (4)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_op="Extract Token Number (3)" to_port="document"/>
    <connect from_op="Extract Token Number (3)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="subprocess" compatibility="7.6.001" expanded="true" height="82" name="Sentence Cleanup" width="90" x="581" y="136">
    <process expanded="true">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (4)" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="text"/>
    <parameter key="attributes" value="text"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="7.6.001" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (5)" width="90" x="313" y="34">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="regular"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.6.001" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
    <parameter key="old_name" value="id"/>
    <parameter key="new_name" value="sentence"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text (4)" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="sentence"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="715" y="34">
    <parameter key="create_nominal_ids" value="false"/>
    <parameter key="offset" value="0"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (5)" width="90" x="849" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="attn"/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <connect from_port="in 1" to_op="Select Attributes (4)" to_port="example set input"/>
    <connect from_op="Select Attributes (4)" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Set Role (5)" to_port="example set input"/>
    <connect from_op="Set Role (5)" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Nominal to Text (4)" to_port="example set input"/>
    <connect from_op="Nominal to Text (4)" from_port="example set output" to_op="Generate ID (2)" to_port="example set input"/>
    <connect from_op="Generate ID (2)" from_port="example set output" to_op="Select Attributes (5)" to_port="example set input"/>
    <connect from_op="Select Attributes (5)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (9)" width="90" x="581" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="Anzahl_Sätze_Text"/>
    <parameter key="attributes" value="id|Anzahl_Saetze_Text|ConfluenceTitle"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <operator activated="true" breakpoints="after" class="join" compatibility="7.6.001" expanded="true" height="82" name="Join" width="90" x="849" y="34">
    <parameter key="remove_double_attributes" value="true"/>
    <parameter key="join_type" value="inner"/>
    <parameter key="use_id_attribute_as_key" value="true"/>
    <list key="key_attributes"/>
    <parameter key="keep_both_join_attributes" value="false"/>
    </operator>
    </process>

    I use the multiply operator to "keep" the attributes and pass them by the transponding - I think then there is the "magic" point where the generation of the rows has to happen...

    The thing is of course, the the attributes which vanish after transponding have only one value (row) before the splitting into sentences happens. So I think there must be sth. used like a loop to fill the rest of the rows. I found such an operator, but have not been able yet to figure out how to use it properly - but maybe I'm also on the wrong way here...

    Any hints much appreciated.

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    I'll give it another try, this times split up into several posts...

    My current result (in csv):

     

    att_1.0;WortanzahlSatz;text;id
    1.0;6.0;Andernfalls wird man erhebliche Probleme bekommen ;11.0
    1.0;35.0;Ab dem Windows Server 2012 ist .NET 3.5 welches auch .NET 2.0 enthält welches für BI wie auch für den CCR relevant ist nicht mehr im Standardinstallationsumfang enthalten sondern liegt nur auf der Installations-CD vor ;15.0
    1.0;14.0;Alternativ kann man dies auch über die IIS Management Konsole machen dazu folgende Galerie ;16.0
    1.0;19.0;Anlage der virtuellen Applikation 'cgi-bin' Nun wird mittels Rechtsklick auf den angelegten Ordner 'ibmcognos' eine neue virtuelle Applikation angelegt ;17.0
    1.0;21.0;Anlage virtuelles Verzeichnis 'ibmcognos' Man öffnet die IIS Mangement Konsole und legt nun zunächst das virtuelle Verzeichnis für den Webzugriff an ;18.0
    1.0;27.0;Anpassung der URL bei Verwendung von ISAPI Wird anstatt der CGI-Konfiguration ISAPI verwendet muß die default Einstellung in der Cognos Configuration bei den Umgebungsparametern noch geändert werden ;19.0
    1.0;16.0;Auf dem Applikationsserver werden nun zunächst die Rollen 'Applikationsserver' und 'Webserver' installiert hier werden neben der ;20.0
    1.0;19.0;Auf einem bestehenden Server oder bei einer Migration von einem alten System muß unbedingt die Collation übernommen werden bspw. ;21.0
    1.0;14.0;Bei der Anlage einer neuen Instanz MSSQL Server sind die Database Engine Services ausreichend ;22.0

    The attribute "WortanzahlSatz" is actually the number of words in the corresponding sentence (each row).

  • xt_crayxt_cray Member Posts: 20 Contributor I

    Another try - attached this time two csv-files.

    The attribute "WortanzahlSatz" is actually the number of words in the corresponding sentence (each row).

    What I would like to have ideally is added in the idealresult.csv.

    The meaning of the additional attributes (which I already have in the beginning): "Anzahl_Saetze_Text" is the number of sentences in the whole text which comes from the repository (at the moment only one document to keep computation time down), "ConfluenceSpaceName" is the origin "folder" of the text (actually a space in the wiki I got the text extracted from) and "ConfluenceTitle" is the header/title of the document.
    However I would already be happy if I can get the number of sentences in the whole text as another column with values in the final dataset - as said, all attributes would be ideal. :-)

     

    I use a multiply operator to try to "keep" the attributes and pass them by the transponding - I think then there is the "magic" point where the generation of the rows has to happen... The thing is of course, the the attributes which vanish after transponding have only one value (row) before the splitting into sentences happens. So I think there must be sth. used like a loop to fill the rest of the rows. I found such an operator, but have not been able yet to figure out how to use it properly - but maybe I'm also on the wrong way here...

     

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    Another try - attached two CSVs and my process as a png (mabye that's why it's not stored...).

    The attribute "WortanzahlSatz" is actually the number of words in the corresponding sentence (each row).

    What I would like to have ideally is added in the CSV "idealresult.csv".

    The meaning of the additional attributes (which I already have in the beginning): "Anzahl_Saetze_Text" is the number of sentences in the whole text which comes from the repository (at the moment only one document to keep computation time down), "ConfluenceSpaceName" is the origin "folder" of the text (actually a space in the wiki I got the text extracted from) and "ConfluenceTitle" is the header/title of the document.
    However I would already be happy if I can get the number of sentences in the whole text as another column with values in the final dataset - as said, all attributes would be ideal. :-)

    I use the multiply operator to "keep" the attributes and pass them by the transponding - I think then there is the "magic" point where the generation of the rows has to happen...

    The thing is of course, the the attributes which vanish after transponding have only one value (row) before the splitting into sentences happens. So I think there must be sth. used like a loop to fill the rest of the rows. I found such an operator, but have not been able yet to figure out how to use it properly - but maybe I'm also on the wrong way here...

     

    Oliver

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @xt_cray - you were having trouble posting because you kept hitting the spam filter  :)

     

    Scott

  • xt_crayxt_cray Member Posts: 20 Contributor I

    @sgenzer wrote:

    hello @xt_cray - you were having trouble posting because you kept hitting the spam filter  :)

     

    Scott


    Hello @sgenzer,

    thank you for the information - but why is that? Because of the content? Are there certain things not to post, like the csv I did?

     

    Oliver

  • xt_crayxt_cray Member Posts: 20 Contributor I

    hi @kayman,

     

    thank you very much! That's what I was looking for. However one thing for explanation please: Why does the "out" line give me the table I want and example output line reduces it again to one row with just the wordcount of "1"?

     

    Oliver

  • kaymankayman Member Posts: 662 Unicorn

    harder to explain than to understand :-)

     

    The out part is basically the result of the inner process, whereas the example is the actual input of your set that you use to start the inner process, in this case the iteration part.

     

    Something like that...

  • xt_crayxt_cray Member Posts: 20 Contributor I

    Ok - well, that may take some time to understand, but thanks for the explanation. Maybe continuous working with the software will give me more insight over time.

     

    Oliver

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @xt_cray - no idea why those messages hit the spam filter but I just released them.  Now you see lots of duplicates etc... Sorry about that.  If you ever have problems like that again, just PM me.

     

    Scott

Sign In or Register to comment.