RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Suggest for this project

ElenaVetElenaVet Member Posts: 9 Learner I
edited May 20 in Help
Goodmorning everyone,
I have a data dataset composed as follows:
- ts: the date on which the news was published;
- body: the text of the news;
- stock: ticker of the action to which the news refers (e.g. TWTR: Twitter);
- positive: integer> = 0. Indicates a count of "positive" words, from a financial point of view, found in the news;
- negative: integer> = 0. Indicates a count of "negative" words, from a financial point of view, found in the news.
In particular I have to carry out:
1) Exploratory data analysis
2) Data analysis techniques which means:
◼ Association rules
◼ Clustering = Perform multiple analysis sessions with one or more algorithms (e.g., KMeans,
DBSCAN) + Evaluate the various expert quality indexes (e.g., SSE).
Do you have any suggestions on where to start and how should I move?
Thanks so much!!!

Best Answer

Answers

  • ElenaVetElenaVet Member Posts: 9 Learner I
    Thank you, @lionelderkrikor
    Your answer is inspirational! Do you think, however, a pre-precessing of textual data is necessary? How do you think it is right to start about it?
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,073   Unicorn
    Hi @ElenaVet,

     If I good understand, your "body" attribute is a text -attribute, so , yes, you have to pre-process this attribute
    by tokenizing etc. inside a Process Document subprocess to create a "word vector".
    To perform this pre-processing step, you can see videos on the RapidMiner Academy by searching "text mining" or
    you can search directly some resources directly inside RapidMiner Studio with the top-right search box like you did for "clustering" and "association rules".

    Regards,

    Lionel 
     
  • ElenaVetElenaVet Member Posts: 9 Learner I
    @lionelderkrikor
    thanks a lot! I also notice that some items aren't in English (for example German, Italian, Spanish and others..), how can I select only English news?
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,073   Unicorn
    Hi @ElenaVet,

    You can use the "Text Vectorization" operator : 
     - select your text attribute (in your case "body" if I good understand)
     - select add language in the parameters of this operator
     - the operator will generate an attribute called "language" with different values according to the language of your news: english, italian, spanish etc.
     - Then use a Filter examples operator to filter only the examples with language = english

    Regards,

    Lionel
    ElenaVet
  • ElenaVetElenaVet Member Posts: 9 Learner I
    @lionelderkrikor
    unfortunately, filter examples can't recognize language label. Is it necessary a Multiply? Or maybe I have to Write a new CSV with the new label and then work on it? 
    Thanks
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,073   Unicorn
    @ElenaVet,

    If the name of the language attribute does not appear, you have to enter it manually : 



    Regards,

    Lionel
Sign In or Register to comment.