Text Mining classification problem with two data sets

mschmidkonmschmidkon Member Posts: 2 Contributor I
edited July 2019 in Help
I have an issue with text mining and classification according to keywords with two datasets. The goal is to classify products according to textual description.

I've got two data sets, the first one contains a unique identifier (a number representing a product) and four columns including text describing this product (short/long text description etc.). The second data set contains two columns, the first one is text describing a label for classification and the second column contains a classification code. The goal is to classify the products from data set 1 according to the second data set, therefore, identical word occurences have to be identified and the classification code with the highest occurences of similar words should be taken. The process should take one product from the first data set and look up all labels from the second data set in order to find the best suiting label.
I created a RapidMiner process which reads the two csv files seperately, converts the input with 'Process Documents from Data' including Tokenizing, Filter Stopwords, Stem and Generate n-Grams. The result set includes the occurences of the tokenized words and now I want to compare the result sets of the two data sets (both data sets don't have the same amount of attributes in the same order, but there are identical ones) with the goal to find 'similar' words and classify the product. Does anybody know how to compare these two datasets with an operator from rapidminer and how to classify these products?

Thank you very much!


Best Answer


Sign In or Register to comment.