data preparation for rapidminer

South2wood · November 2011

Hello,

currently I am working on my diploma thesis. The task is to develop a metric, which describes competition in search engine rankings for a specific keyword. For example an online marketer wants to optimize his website for the keywords “car sale” or “buy car” and his budget is limited, so he has to choose between the two. The model I am going to develop should tell him which of the keywords should be easier to optimize for.

So we pulled Google search data for 10.000 keywords. My first problem: All data should be in one big table where one row is a unique example, right? At the moment I have three csv-files.

File 1: keyword set and some attributes (10.000 rows)
One row looks like that:
[keyword] [keyword_attribute_1] … [keyword_attribute_n]

File 2: Search result data. For each keyword we pulled the top 30 search results at google (10.000 keywords x 30 Ranking Positions = 300.000 rows)
One row looks like that:
[keyword] [Position in Google Search Results] [attribute_1] …[for example: url] … [attribute_n]
… the problem: for each keyword I have 30 rows (=ranking-positions):
[keyword] [1] [pos1_attribute_1] … [pos1_attribute_n]
[keyword] [2] [pos2_attribute_1] … [pos2_attribute_n]
[keyword] [3] [pos3_attribute_1] … [pos3_attribute_n]
…
[keyword] [30] [pos30_attribute_1] … [pos30_attribute_n]
… and then the next keyword with it’s 30 positions

File 3: Hosting Domain
From the URL of File 2 we took the host and pulled some data, e.g. PageRank, age, … of the domain
One row looks like that:
[Host URL] [host_attribute_1] … [host_attribute_n]

What is the right thing to do now? The keywords are the unique IDs of my example set. So should I merge the three files to a big table with 10.000 rows (one for each keyword) where one row holds all the attributes, like that (the colums:)
[keyword]
[keyword_attribute_1] … [keyword_attribute_n]
[pos1_attribute_1] … [pos1_attribute_n]
[pos1_host_data]
[pos2_attribute_1] … [pos2_attribute_n]
[pos2_host_data]
[pos3_attribute_1] … [pos3_attribute_n]
[pos3_host_data]
…
[pos30_attribute_1] … [pos30_attribute_n]
[pos30_host_data]

This would result in a table with 10.000 rows and approximately 400 colums. For me, this is huge. But I am new to data mining, so maybe it has to be this way. Could someone give me an advice of how to do the merging?

Best regards,
Matthias

MariusHelf · December 2011

Hi Matthias,

how to represent the data depends strongly on what exactly you are planning to do. It is correct, that you need one row in a data set per example. But what is one example? Without knowing the details of your problem, I see two possibilities. The first one is the one you proposed, i.e. all data for one keyword in one row. 400 attributes is not huge (but obviously neither small), but you can reduce it by using a Feature Selection or weighting the attributes with the operators found in the Modeling/Attribute Weighting group and cutting at a certain level of importance.

Another representation might be to store the keyword properties together with each search result, together with the ranking of the result. That way you get more examples with less columns, but if that's reasonable depends on your further processing.

You certainly already found the Join operator which will be helpful to join the data together. Additionally you should consider to import the data from the csv files into the repository. That way, subsequent access will be faster, plus the repository also stores metadata such as attribute roles, types etc. Of course you can also store intermediate results in the repository, e.g. your preprocessed and joined datasets.

Cheers, Marius

South2wood · December 2011

Thank you for your reply.

My problem is the following: I selected a couple of attributes to describe the value of a keyword. For example: cost per click / search volume - If it is high, the value of a keyword is high. Now I have to find a function hidden inside my dataset that applies a number between 0 and 1 (type real) to each keyword based on the attributes of the keywords. This number should tell wheter the value of a keyword is high or not compared to other keywords.

Which algorithm should I use? I have read that I have to seperate my data in training, test and validation data. So should I use a clustering algorithm to calculate the simularity of my training examples? Than I apply the algorithm to the test and validation set.

But how do I get an algorithm that can also be translated into a programming language and to be used calculate the metric on unknown examples?

Question over questions. Maybe someone could point me into the right direction.

Thank you.

Best regards,
Matthias

MariusHelf · December 2011

Hi,

the choice of the algorithm is not easy, and often requires testing several approaches. It sounds like what you want is some kind of feature weighting algorithm. You could try one of the various weighting operators and have a look if their output is valuable to you. Also, many learning algorithms output attribute weights, e.g. the SVM.

Once you have some kind of (weighting) model, you have to evaluate it. For that, as you stated correctly, you separate your data into a training set and a test set (the test set is also known as validation set). For that you can use the X-Validation in RapidMiner, or the Split Validation.

IF one of the weighting operators are working, you already have an algorithm and a program

You can embed RapidMiner processes into your own Java application. If your task, however, is to invent and implement an own algorithm, you will have to write your own operators (assuming that you want to stick to RapidMiner). Then again you have your own code

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

data preparation for rapidminer

Answers