Extracting wrapped cells from a CSV file

Thomas_Ott · August 2018

Ok, here's one that's stumping me. I'm attaching a CSV file that has some Instragram comments. The file is generated from a python script that I'll eventually embed into RM but right now I'm just trying to get this to work. My goal is to use these comments for some text processing later on.

My problem is correctly parsing the entire comment from each cell. For some strange reason, I can't seem to do it correctly. The text seems to be wrapped in the cell. Any ideas on how to do this?

lionelderkrikor · August 2018

Hi @Thomas_Ott,

Here, with the Nominal to Text operator between the Read CSV and Process Document from Data /Tokenize operators,

a TF_IDF dataset is well generated :

"...The text seems to be wrapped in the cell..."

What problem do you encounter precisely ?

Regards,

Lionel

Thomas_Ott · August 2018

The problem is when I use the import wizard for Read CSV, the preview doesn’t look right.

lionelderkrikor · August 2018

Hi again @Thomas_Ott,

Indeed, RapidMiner has difficulties to read correctly your file.

So, once again, I propose a Python script as a possible solution (I used the read_csv function and filtered

the "carriage returns" which cause some problems in RapidMiner) :

The process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="112" y="85">
        <parameter key="script" value="import pandas as pd&#10;import re&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;path = 'C:/Users/Lionel/Documents/Formations_DataScience/Rapidminer/Tests_Rapidminer/Text_Mining_Thomas_Ott/out2.csv'&#10;&#10;&#10;&#10;def rm_main():&#10;&#10;  data = pd.read_csv(path)&#10;  data.iloc[0] = data.iloc[0].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)&#10;  data['0'] = data['0'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)&#10;  data = data.replace(r'\n','', regex=True)&#10;  data = data.replace(r'\r','', regex=True)&#10;   &#10;    # connect 2 output ports to see the results&#10;  return data"/>
      </operator>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

To execute this process, don't forget to set the path (where your file is stored) in the Python script.

I hope it helps,

Regards,

Lionel

Thomas_Ott · August 2018

@lionelderkrikor I see, you used iloc[0] in a dataframe. So RM loads in one row at a time, hence you can use iloc[0] instead of looping. Correct?

I wonder if the trick is to write the data out correctly in the first place and avoid all this mess. Do you happen to know of a 'remove carriage return option' when writing out the dataframe?

Before you ask, why not skip exporting the CSV and just pass it through the RM process, I get that but the requirement is to spit out the CSV.

Thanks BTW!

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import json
from pprint import pprint
import csv

df = pd.read_json('https://www.instagram.com/explore/tags/keto/?__a=1')

text_export = []
rawtext = df['graphql']['hashtag']['edge_hashtag_to_media']['edges']
for i in rawtext:
    rawtext = i['node']['edge_media_to_caption']['edges']
    #print(rawtext)
    for j in rawtext:
        final_text = j['node']['text']
        df['final_text'] = j['node']['text']
        text_export.append(final_text)
        #print(df['final_text'])

text_out = pd.DataFrame(text_export)
text_out.to_csv('out2.csv')

Thomas_Ott · August 2018

@lionelderkrikor ok, I solved it. The data was incredibly messy. I just striped all white spaces using \s and then just replaced with a whitespace. Kinda stupid but it works.

text_out.replace(r'\s',' ', regex=True, inplace=True)

Now I get everything I want in output file AND I can productionalize things.

Thanks for your help.

SGolbert · August 2018

Hi @Thomas_Ott,

good that you solved it. I've had similar problems when reading CSV generated with R or Python before.

As a general note, I have found that saving your external tables as XML can be advantageous, as the Read XML operator is way more forgiving than the Read CSV one. Then you can retain special characters.

Regards,

Sebastian

Thomas_Ott · August 2018

@SGolbert I haven't thought of doing that. A handy tip indeed.

sgenzer · August 2018

another trick I use in situations like these is to convert the text to URL-encoded text which makes it MUCH easier to strip out unwanted characters, then convert back.

Screen Shot 2018-08-16 at 9.38.54 AM.png

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extracting wrapped cells from a CSV file

Answers