Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Mining source code files
confusedMonMon
Member Posts: 14 Learner III
Hi there,
I'm new to the mining world and what I'm looking for is mining source code files, i.e files written in programming languages. I thought since source codes are textual data then I can find some text mining tool to mine them, and picked RapidMiner as it is one of the most famous text mining tools. Unfortunately, it couldn't read such files. Am I missing something here? do you have any advice on how to mine such files?
Many thanks
I'm new to the mining world and what I'm looking for is mining source code files, i.e files written in programming languages. I thought since source codes are textual data then I can find some text mining tool to mine them, and picked RapidMiner as it is one of the most famous text mining tools. Unfortunately, it couldn't read such files. Am I missing something here? do you have any advice on how to mine such files?
Many thanks
0
Best Answers
-
yyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistHi @confusedMonMon,
if you have source codes files, saying .sql, .c, .py files, you would need to read document operator from text processing extension.<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="open_file" compatibility="9.2.001" expanded="true" height="68" name="Open File" width="90" x="112" y="34"> <parameter key="resource_type" value="URL"/> <parameter key="url" value="https://raw.githubusercontent.com/Marcnuth/AnomalyDetection/master/anomaly_detection/anomaly_detect_vec.py"/> </operator> <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="380" y="34"> <parameter key="extract_text_only" value="true"/> <parameter key="use_file_extension_as_type" value="true"/> <parameter key="content_type" value="txt"/> <parameter key="encoding" value="SYSTEM"/> </operator> <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/> <connect from_op="Read Document" from_port="output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
7 -
SGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicornhere is an example of the use of the Text Mining extension with the source code of two Python scripts:<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
<parameter key="text" value="# Self Organizing Map # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Credit_Card_Applications.csv') X = dataset.iloc[:, 1:-1].values y = dataset.iloc[:, -1].values # Feature Scaling from sklearn.preprocessing import MinMaxScaler sc = MinMaxScaler(feature_range = (0, 1)) X = sc.fit_transform(X) # Training the SOM from minisom import MiniSom som = MiniSom(x = 10, y = 10, input_len = 14, sigma = 1.0, learning_rate = 0.5) som.random_weights_init(X) som.train_random(data = X, num_iteration = 200) # Visualizing the results from pylab import bone, pcolor, colorbar, plot, show bone() pcolor(som.distance_map().T) colorbar() markers = ['o', 's'] colors = ['r', 'g'] for i, x in enumerate(X): w = som.winner(x) plot(w[0] + 0.5, w[1] + 0.5, markers[y[i]], markeredgecolor = colors[y[i]], markerfacecolor = 'None', markersize = 10, markeredgewidth = 2) show() # Finding the frauds mappings = som.win_map(X) frauds = np.concatenate((mappings[(8,1)], mappings[(6,8)]), axis = 0) frauds = sc.inverse_transform(frauds)"/>
<parameter key="add label" value="false"/>
<parameter key="label_type" value="nominal"/>
</operator>
<operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (2)" width="90" x="112" y="187">
<parameter key="text" value="# Recurrent Neural Network # Part 1 - Data Preprocessing # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the training set dataset_train = pd.read_csv('Google_Stock_Price_Train.csv') training_set = dataset_train.iloc[:, 1:2].values # Feature Scaling from sklearn.preprocessing import MinMaxScaler sc = MinMaxScaler(feature_range = (0, 1)) training_set_scaled = sc.fit_transform(training_set) # Creating a data structure with 60 timesteps and 1 output X_train = [] y_train = [] for i in range(60, 1258): X_train.append(training_set_scaled[i-60:i, 0]) y_train.append(training_set_scaled[i, 0]) X_train, y_train = np.array(X_train), np.array(y_train) # Reshaping X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1)) # The last "1" corresponds to the number of dataset, for example to include # another related stock in the predictions # Part 2 - Building the RNN # Importing the Keras libraries and packages from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Dropout # Initialising the RNN regressor = Sequential() # Adding the first LSTM layer and some Dropout regularisation regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1))) regressor.add(Dropout(0.2)) # Adding a second LSTM layer and some Dropout regularisation regressor.add(LSTM(units = 50, return_sequences = True)) regressor.add(Dropout(0.2)) # Adding a third LSTM layer and some Dropout regularisation regressor.add(LSTM(units = 50, return_sequences = True)) regressor.add(Dropout(0.2)) # Adding a fourth LSTM layer and some Dropout regularisation regressor.add(LSTM(units = 50)) regressor.add(Dropout(0.2)) # Adding the output layer regressor.add(Dense(units = 1)) # Compiling the RNN regressor.compile(optimizer = 'adam', loss = 'mean_squared_error') # Regression problem # Fitting the RNN to the Training set regressor.fit(X_train, y_train, epochs = 100, batch_size = 32) # Part 3 - Making the predictions and visualising the results # Getting the real stock price of 2017 dataset_test = pd.read_csv('Google_Stock_Price_Test.csv') real_stock_price = dataset_test.iloc[:, 1:2].values # Getting the predicted stock price of 2017 dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0) inputs = dataset_total[len(dataset_total) - len(dataset_test) - 60:].values inputs = inputs.reshape(-1,1) inputs = sc.transform(inputs) X_test = [] for i in range(60, 80): X_test.append(inputs[i-60:i, 0]) X_test = np.array(X_test) X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1)) predicted_stock_price = regressor.predict(X_test) predicted_stock_price = sc.inverse_transform(predicted_stock_price) # undo normalization # Visualising the results plt.plot(real_stock_price, color = 'red', label = 'Real Google Stock Price') plt.plot(predicted_stock_price, color = 'blue', label = 'Predicted Google Stock Price') plt.title('Google Stock Price Prediction') plt.xlabel('Time') plt.ylabel('Google Stock Price') plt.legend() plt.show() "/>
<parameter key="add label" value="false"/>
<parameter key="label_type" value="nominal"/>
</operator>
<operator activated="true" class="collect" compatibility="9.2.001" expanded="true" height="103" name="Collect" width="90" x="246" y="34">
<parameter key="unfold" value="false"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="9.2.001" expanded="true" height="82" name="Loop Collection" width="90" x="447" y="34">
<parameter key="set_iteration_macro" value="false"/>
<parameter key="macro_name" value="iteration"/>
<parameter key="macro_start_value" value="1"/>
<parameter key="unfold" value="false"/>
<process expanded="true">
<operator activated="true" class="text:remove_document_parts" compatibility="8.1.000" expanded="true" height="68" name="Remove Document Parts" width="90" x="380" y="34">
<parameter key="deletion_regex" value="#(.*?)\n"/>
</operator>
<connect from_port="single" to_op="Remove Document Parts" to_port="document"/>
<connect from_op="Remove Document Parts" from_port="document" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">remove comments</description>
</operator>
<operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="648" y="34">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="34">
<parameter key="max_length" value="2"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">tokenize, do n-grams and count frequencies<br/></description>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Collect" to_port="input 1"/>
<connect from_op="Create Document (2)" from_port="output" to_op="Collect" to_port="input 2"/>
<connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>You would have to work quite a bit in the subprocess of the Process Documents operator to get something useful.Regards,Sebastian5
Answers
Ingo