Options

Is what I'd like to do even possible with RapidMiner?

Hal_SegalHal_Segal Member Posts: 2 Newbie
I'm new to machine learning, and I wonder if what I'd like to do is even possible with RapidMinder. I'd welcome suggestions!
1. My data set will be about 30 years of monthly economic data for Canada and will contain about 30 variables for each time period -- things like gross national product, size of the workforce, the money supply, the interest rate, etc.
2. It would be an unsupervised task since we want the algorithm to determine the relationship between the variables.
3. I understand that RNN with LSTM is the state-of-the-art for this type of problem.
4. After the model is up and running, I'd like to test different values of certain independent variables for future time periods. For example, if the government sets the interest rate at x% and sets public spending at $y for the coming 12 months, we want the model to predict all the other variables.
5.  How would I proceed to set up this type of task? Are there any particular techniques I should know about to accomplish this?
Thanks for your suggestions!

Answers

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello @Hal_Segal

    Indeed this is possible to do, however 30 years of economic data might seem to be a lot.

    To specify techniques and all that stuff, I would need to see part of the data (I don't care about the data itself but about the shape: what's your predictive label, what kind of categorical, numerical or date data you have, etc...). Is it feasible?

    All the best,

    Rod.
  • Options
    Hal_SegalHal_Segal Member Posts: 2 Newbie
    Rod,
    Thanks for your reply! Let me describe the data set for Canada, and if you'd like to see the actual data, please send me your email address.
    Description of the excel spreadsheet I'm preparing: There are currently 293 rows, each for monthly economic data starting in Jan 1997 and going until May 2021. 
    Below are the titles of the 33 columns.
    I look forward to your thoughts!
    Thanks,
    Hal Segal

    1. Period column - starts in Jan 1997. This is the first period available from statcan.

    2. GDP - Gross Domestic Product. Table number 36100434 (3790031). This is in 2012 constant dollars, seasonally adjusted, for the entire Canadian economy. All statistics are in Canadian dollars.

    3. Number of people employed

    4. Number of people with part-time employment

    5. Number of people unemployed

    6. Canada CPI - Consumer Price Index

    7. US CPI 

    8. Canada PPI -  Producer Price Index

    9. US PPI

    10. Canada Consumer Confidence

    11. US Consumer Confidence

    12. Canada Business Climate Indicator

    13. US Business Confidence

    14. Canada Stock Market

    15. US Stock Market

    16. Retail Sales

    17. Consumer Spending

    18. Producer Spending

    19. Personal Savings

    20. Consumer Credit

    21. Consumer Disposible Income

    22. Households Debt to GDP

    23. Households Debt to Income

    24. Building Permits

    25. Imports

    26. Exports

    27. Canada Inflation Rate

    28. US Inflation Rate

    29. Government Spending

    30. Leading Economic Index

    31. QE bond purchases

    32. Foreign Direct Investment

    33. Tourist Arrivals



  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello @Hal_Segal

    1. My data set will be about 30 years of monthly economic data for Canada and will contain about 30 variables for each time period -- things like gross national product, size of the workforce, the money supply, the interest rate, etc.
    According to what you sent, almost all of it is numeric. Rather than using unsupervised tasks, you are trying to look for correlations between several columns.
    2. It would be an unsupervised task since we want the algorithm to determine the relationship between the variables.
    You'll need to run many of them, many times, to adjust the parameters. I'd start with X-Means to have an idea on how it behaves, and attach a decision tree that serves the purpose of explaining why the cluster composition is like that. I'd begin with this.
    3. I understand that RNN with LSTM is the state-of-the-art for this type of problem.
    I don't think so, recurrent neural networks are more for processing video, audio or time compositions. If you are planning to do timeframe based data, there is a whole lot of other things in RapidMiner that can help you: time series.

    BTW, neural networks are supervised.
    4. After the model is up and running, I'd like to test different values of certain independent variables for future time periods. For example, if the government sets the interest rate at x% and sets public spending at $y for the coming 12 months, we want the model to predict all the other variables.
    When you want to set a number and get a correlated number, it's normally a regression problem. Remember when I said "use a decision tree to explain why the cluster behaves like that?" The decision tree will inform you what parameters you should use in this regression.

    5.  How would I proceed to set up this type of task? Are there any particular techniques I should know about to accomplish this?
    This is how I should proceed:
    • Create a new repository to structure your study. I'll give a specific one to you, but feel free to ignore it and do yours.
    • Import your raw data into RapidMiner Studio.
    • On one process, perform the basic checks: mean, median, average, standard deviations.
    • On another process, perform a correlation matrix, so you'll see possible influences.
    • Create different processes for a number of clustering algorithms that might give you patterns. Unsupervised machine learning algorithms will give you trash data if you don't adjust it properly, therefore both the correlation matrix and the basic checks will help you understand if such a clustering algorithm makes sense or not. On another note: many people tend to think that unsupervised algorithms will do the job for you, but that isn't the case: you then need to interpret those things, and a simple way to do that is a decision tree (or another classification algorithm). You will spend some time here until you get a good classification.
    • Use the classification to create linear regressions around your correlated attributes. A regression algorithm is the inference of a mathematical function that describes the correlation between two variables. Since you want to use any parameter, that means you'll need at least one linear regression per parameter you want to evaluate, which is rather unpleasant, but to my knowledge there aren't ways that will help you predict two, three or four columns from the other 29.
    • If you want to adventure yourself, you can change the linear regression algorithm to a neural network, a deep learning extension or something on that note.
    Now, you'll need a lot of creativity on each step, but tbh it's not the job of the software to give you the answers you are looking for, it is to drive you to get those answers based in the results the software gave you.
    Thanks for your suggestions!

    Sorry for the late reply.

Sign In or Register to comment.