Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
YouTube scraper - Python Ooperator help needed
I am trying to user a Python scraper to collect URL's from a YouTube channel. The Python process works when executing from terminal, but not when I use it inside of the Execute Python operator. My python is not the strongest so I would appreciate any direction in terms of where I am going wrong.
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.4.000" expanded="true" name="Process" origin="GENERATED_TUTORIAL"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.10.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="https://www.youtube.com/channel/UCrla0VcTLEhGb03wsosZ6SQ"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> <description align="center" color="transparent" colored="false" width="126">This is generate data. In prod I use read from DB to get the data</description> </operator> <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.001" expanded="true" height="103" name="Execute Python" origin="GENERATED_TUTORIAL" width="90" x="179" y="34"> <parameter key="script" value="import pandas def rm_main(data): path = "clarifai.csv" # store example data in a file data.to_csv(path) myfile = open(path,'r') # return the file return(myfile) "/> <parameter key="notebook_cell_tag_filter" value=""/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> <parameter key="use_macros" value="false"/> <description align="center" color="transparent" colored="false" width="126">Save example data in a csv file and hand the file over to the next Python operator</description> </operator> <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.001" expanded="true" height="103" name="Execute Python (2)" origin="GENERATED_TUTORIAL" width="90" x="313" y="34"> <parameter key="script" value="import pandas as pd from requests_html import HTMLSession def rm_main(myfile): 	print('I received the following data set:') 	data = pd.read_csv(myfile, header=None, names=list(('id', 'url'))) 	urls = data['url'].tolist() 	for url in urls: 		session = HTMLSession() 		add = url 		response = session.get(add) 		response.html.render(sleep=1, keep_page = True, scrolldown = 2) 		for links in response.html.find('a#video-title'): 			link = next(iter(links.absolute_links)) 			print(link) "/> <parameter key="notebook_cell_tag_filter" value=""/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> <parameter key="use_macros" value="false"/> <description align="center" color="transparent" colored="true" width="126">Read data from YouTube and return link output</description> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Execute Python" to_port="input 1"/> <connect from_op="Execute Python" from_port="output 1" to_op="Execute Python (2)" to_port="input 1"/> <connect from_op="Execute Python (2)" from_port="output 1" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Best Answer
MarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 UnicornHi @Robi_Me:
I tried your code and it seems to work on my end after I installed https://anaconda.org/conda-forge/requests-html to my anaconda env.