Suggest for this project

ElenaVet · May 2020

Goodmorning everyone,

I have a data dataset composed as follows:

- ts: the date on which the news was published;

- body: the text of the news;

- stock: ticker of the action to which the news refers (e.g. TWTR: Twitter);

- positive: integer> = 0. Indicates a count of "positive" words, from a financial point of view, found in the news;

- negative: integer> = 0. Indicates a count of "negative" words, from a financial point of view, found in the news.

In particular I have to carry out:

1) Exploratory data analysis

2) Data analysis techniques which means:

◼ Association rules

◼ Clustering = Perform multiple analysis sessions with one or more algorithms (e.g., KMeans,

DBSCAN) + Evaluate the various expert quality indexes (e.g., SSE).

Do you have any suggestions on where to start and how should I move?

Thanks so much!!!

lionelderkrikor · May 2020

Hi @ElenaVet,

1/ You can begin by seeing some videos on the RapidMiner Academy :
- about clustering :
https://academy.rapidminer.com/catalog?query=clustering

- about association rules :
https://academy.rapidminer.com/catalog?query=association%20rules

2/ More over you have process templates regarding AR and clustering in RapidMiner :

Image: https://us.v-cdn.net/6030995/uploads/editor/yr/3ac3ybcm50oz.png

3/ More generally, you have a lot of resources by searching in the top-right search box of RapidMiner Studio :

Image: https://us.v-cdn.net/6030995/uploads/editor/lk/2km0qb0lvapu.png

Image: https://us.v-cdn.net/6030995/uploads/editor/lm/m5gm1uusdxdg.png

Hope this helps,

Regards,

Lionel

ElenaVet · May 2020

Thank you, @lionelderkrikor!
Your answer is inspirational! Do you think, however, a pre-precessing of textual data is necessary? How do you think it is right to start about it?

lionelderkrikor · May 2020

Hi @ElenaVet,

If I good understand, your "body" attribute is a text -attribute, so , yes, you have to pre-process this attribute
by tokenizing etc. inside a Process Document subprocess to create a "word vector".
To perform this pre-processing step, you can see videos on the RapidMiner Academy by searching "text mining" or
you can search directly some resources directly inside RapidMiner Studio with the top-right search box like you did for "clustering" and "association rules".

Regards,

Lionel

ElenaVet · May 2020

@lionelderkrikor
thanks a lot! I also notice that some items aren't in English (for example German, Italian, Spanish and others..), how can I select only English news?

lionelderkrikor · May 2020

Hi @ElenaVet,

You can use the "Text Vectorization" operator :
- select your text attribute (in your case "body" if I good understand)
- select add language in the parameters of this operator
- the operator will generate an attribute called "language" with different values according to the language of your news: english, italian, spanish etc.
- Then use a Filter examples operator to filter only the examples with language = english

Regards,

Lionel

ElenaVet · May 2020

@lionelderkrikor
unfortunately, filter examples can't recognize language label. Is it necessary a Multiply? Or maybe I have to Write a new CSV with the new label and then work on it?
Thanks

lionelderkrikor · May 2020

@ElenaVet,

If the name of the language attribute does not appear, you have to enter it manually :

Image: https://us.v-cdn.net/6030995/uploads/editor/90/ibc7xpal7tb6.png

Regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Suggest for this project

Best Answer

Answers