"Import into Hive database contains newline characters."

JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
edited May 2019 in Help

Hi guys,

 

When trying to use the Read Database within a Radoop Nest and getting the error:
"The nominal attribute 'ABCDE' has a value that contains a newline. Please replace or remove newlines."

 

What is the best way to deal with it?  I have tried changing the file format from default to ARVO and others to see if it will get past it, but still no joy. 

Is it a bug? 

 

 

Tagged:

Best Answer

  • mborbelymborbely Member Posts: 14 Contributor II
    Solution Accepted

    Hi JEdward,

     

    Please note that with this operator, and generally in Radoop, regardless of the selected storage format if you want to store an example set that hasn't been materialized in Hive, then all of its records are first written to HDFS in text format, and a temporary table is created on top of it. Then later this temporary table is used to populate the final table with the correct storage format. Therefore changing the storage format won't solve this problem.

     

    Radoop throws this error as a preventive measure, since older Hive versions didn't handle newlines in the data properly. Fortunately this changed in newer versions https://issues.apache.org/jira/browse/HIVE-11785. However Radoop has not yet been adapted to these changes. Therefore even if you use Hive version 2.0+, your only option at the moment is replacing these newline characters manually. (E.g. in the SQL query of the Read Database operator)

     

    Cheers,

    Máté

Answers

  • asimonasimon Administrator, Moderator, Employee, Community Manager, RapidMiner Certified Master, Member Posts: 8 RM Engineering

    Hi,

    By default TEXTFILE format is used which is sensitive to newlines in Radoop. The recommended approach is to switch to PARQUET format. Did it work for you?

    Attila

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    Darn, hopefully it will be something incorporated in the near future. 

     

    I'm actually feeding in my list of database tables as attributes and then using Read Database within a Loop Attributes to import a whole bunch of tables automatically and in parallel.  I'll need to try and get creative with my query writing. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="jdbc_connectors:read_database" compatibility="9.0.002" expanded="true" height="68" name="Read Database" width="90" x="45" y="85">
    <parameter key="connection" value="myDatabase"/>
    <parameter key="query" value="SHOW TABLES"/>
    <enumeration key="parameters"/>
    </operator>
    <operator activated="true" class="concurrency:loop_values" compatibility="9.0.002" expanded="true" height="82" name="Loop Values" width="90" x="179" y="85">
    <parameter key="attribute" value="Tables_in_mySchema"/>
    <process expanded="true">
    <operator activated="true" class="jdbc_connectors:read_database" compatibility="9.0.002" expanded="true" height="68" name="Read Database (2)" width="90" x="112" y="136">
    <parameter key="connection" value="myDatabase"/>
    <parameter key="query" value="SELECT TABLE_NAME, TABLE_ROWS&#10;FROM&#10; information_schema.tables&#10; WHERE&#10; table_name=&quot;%{loop_value}&quot;&#10;AND&#10; table_schema=&quot;mySchema&quot;;"/>
    <enumeration key="parameters"/>
    </operator>
    <connect from_op="Read Database (2)" from_port="output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="9.0.002" expanded="true" height="82" name="Append" width="90" x="313" y="85"/>
    <operator activated="true" class="sort" compatibility="9.0.002" expanded="true" height="82" name="Sort" width="90" x="447" y="85">
    <parameter key="attribute_name" value="TABLE_ROWS"/>
    <parameter key="sorting_direction" value="decreasing"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="9.0.002" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="187">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="TABLE_ROWS.le.1000000"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Only select tables less than 1 million records in InfoSchema. This prevents the Radoop bug for Read Database from overloading the memory of executing machine.</description>
    </operator>
    <operator activated="true" class="transpose" compatibility="9.0.002" expanded="true" height="82" name="Transpose" width="90" x="581" y="187"/>
    <operator activated="true" class="rename_by_example_values" compatibility="9.0.002" expanded="true" height="82" name="Rename by Example Values" width="90" x="581" y="85"/>
    <operator activated="true" class="radoop:radoop_nest" compatibility="9.0.001" expanded="true" height="82" name="Radoop Nest" width="90" x="715" y="85">
    <parameter key="connection" value="myCluster"/>
    <parameter key="change_sample_size" value="true"/>
    <parameter key="sample_size" value="100"/>
    <parameter key="hive_file_format" value="ORC"/>
    <enumeration key="tables_to_reload"/>
    <process expanded="true">
    <operator activated="true" class="radoop:loop_attributes_radoop" compatibility="9.0.001" expanded="true" height="82" name="Loop Attributes (Radoop)" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="TABLE_NAME"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="enable_parallel_execution" value="false"/>
    <process expanded="true">
    <operator activated="true" class="radoop:read_db" compatibility="9.0.001" expanded="true" height="68" name="Read Database (5)" width="90" x="179" y="85">
    <parameter key="connection" value="myDatabase"/>
    <parameter key="define_query" value="table name"/>
    <parameter key="table_name" value="%{loop_attribute}"/>
    <enumeration key="parameters"/>
    <parameter key="temporary_table" value="false"/>
    <parameter key="saved_table_name" value="%{loop_attribute}"/>
    </operator>
    <connect from_op="Read Database (5)" from_port="output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="input 1" to_op="Loop Attributes (Radoop)" to_port="input 1"/>
    <connect from_op="Loop Attributes (Radoop)" from_port="output 1" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Read Database" from_port="output" to_op="Loop Values" to_port="input 1"/>
    <connect from_op="Loop Values" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Rename by Example Values" to_port="example set input"/>
    <connect from_op="Rename by Example Values" from_port="example set output" to_op="Radoop Nest" to_port="input 1"/>
    <connect from_op="Radoop Nest" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

Sign In or Register to comment.