Options

Execute Python breaks Colum if text hasta commas

MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
edited March 2019 in Help
Hi I need some help I'm doing some crawling with Python (already tried with RM but I didn't get what I wanted in an easy way)
The last column of DF returns a big chunk of text that describes the product. for some reason when Execute Python creates the DataSet it creates new lines and erases the data that was sent on the DF. I tried writing the info from inside Python Execute and the outcome is a file with 1 row and 5 columns as expected.

Here is the process I'm using.
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros>
      <macro>
        <key>url</key>
        <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&amp;skuId=1059665339</value&gt;
      </macro>
    </macros>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.1.000" expanded="true" height="82" name="Execute Python" width="90" x="179" y="34">
        <parameter key="script" value="import requests&#10;from bs4 import BeautifulSoup&#10;import pandas as pd&#10;&#10;def rm_main():&#10;    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}&#10;    columnas=['id','precio_n','precio_d','nombre','descripcion']&#10;    productos=pd.DataFrame(columns=columnas)   &#10;    session = requests.Session()&#10;    url='%{url}'&#10;    session.post(url,headers=headers)&#10;    content=session.get(url)&#10;    soup = BeautifulSoup(content.text,'html.parser')&#10;    precio_normal=soup.find(&quot;input&quot;,id=&quot;listPrice&quot;)&#10;    tipo=soup.find(&quot;a&quot;,_class=&quot;actual&quot;)&#10;    llave=soup.find(&quot;input&quot;,id=&quot;productId&quot;)&#10;    #productId&#10;    #gtmPrice&#10;    #productDisplayName&#10;    precio_descuento=soup.find(&quot;input&quot;,id=&quot;gtmPrice&quot;)&#10;    producto=soup.find(&quot;input&quot;,id=&quot;productDisplayName&quot;)&#10;    descripcion=soup.find(&quot;div&quot;,id=&quot;intro&quot;).find('p').get_text()&#10;    descripcion=descripcion.replace(',', '')&#10;    descripcion=descripcion.replace('', '')&#10;    #print(descripcion)&#10;    fila=[llave['value'],&#10;                          precio_normal['value'],&#10;                          precio_descuento['value'],&#10;                          producto['value'],&#10;                          descripcion&#10;                          ]&#10;    productos.loc[len(productos)]=fila&#10;    return productos"/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
        <list key="function_descriptions">
          <parameter key="Fecha" value="date_now()"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="date_to_nominal" compatibility="9.1.000" expanded="true" height="82" name="Date to Nominal" width="90" x="514" y="34">
        <parameter key="attribute_name" value="Fecha"/>
        <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="keep_old_attribute" value="false"/>
      </operator>
      <connect from_op="Execute Python" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/>
      <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


Best Answers

Answers

  • Options
    MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Great!!! It works and yes it seems to be a bug. 
    I'll need to make some changes since sometimes the crawling may not have that attribute and the number of rows maybe dynamic but your workaround works like a charm.
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @MarcoBarradas can you pls be more specific about the bug? I'd like to push it internally but need more detail. I'm not a Python coder... :wink:

    Scott
  • Options
    MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Hi @sgenzer the bug is that RM changes the Data Frame when it converts it to a RM Dataset. This happens when one of the attributes has a lot of text. In my example the Dataframe has a ágape of 1 example with 5 attributes. But once Execute Python ends it returns 3 example with 5 attributes and it only returne information of the last attribute. The one that had a lot of text
Sign In or Register to comment.