Is what I'd like to do even possible with RapidMiner?

Hal_Segal · August 2021

I'm new to machine learning, and I wonder if what I'd like to do is even possible with RapidMinder. I'd welcome suggestions!
1. My data set will be about 30 years of monthly economic data for Canada and will contain about 30 variables for each time period -- things like gross national product, size of the workforce, the money supply, the interest rate, etc.
2. It would be an unsupervised task since we want the algorithm to determine the relationship between the variables.
3. I understand that RNN with LSTM is the state-of-the-art for this type of problem.
4. After the model is up and running, I'd like to test different values of certain independent variables for future time periods. For example, if the government sets the interest rate at x% and sets public spending at $y for the coming 12 months, we want the model to predict all the other variables.
5. How would I proceed to set up this type of task? Are there any particular techniques I should know about to accomplish this?
Thanks for your suggestions!

rfuentealba · August 2021

Hello @Hal_Segal

Indeed this is possible to do, however 30 years of economic data might seem to be a lot.

To specify techniques and all that stuff, I would need to see part of the data (I don't care about the data itself but about the shape: what's your predictive label, what kind of categorical, numerical or date data you have, etc...). Is it feasible?

All the best,

Rod.

Hal_Segal · August 2021

Rod,
Thanks for your reply! Let me describe the data set for Canada, and if you'd like to see the actual data, please send me your email address.
Description of the excel spreadsheet I'm preparing: There are currently 293 rows, each for monthly economic data starting in Jan 1997 and going until May 2021.
Below are the titles of the 33 columns.
I look forward to your thoughts!
Thanks,
Hal Segal

Period column - starts in Jan 1997. This is the first period available from statcan.
GDP - Gross Domestic Product. Table number 36100434 (3790031). This is in 2012 constant dollars, seasonally adjusted, for the entire Canadian economy. All statistics are in Canadian dollars.
Number of people employed
Number of people with part-time employment
Number of people unemployed
Canada CPI - Consumer Price Index
US CPI
Canada PPI - Producer Price Index
US PPI
Canada Consumer Confidence
US Consumer Confidence
Canada Business Climate Indicator
US Business Confidence
Canada Stock Market
US Stock Market
Retail Sales
Consumer Spending
Producer Spending
Personal Savings
Consumer Credit
Consumer Disposible Income
Households Debt to GDP
Households Debt to Income
Building Permits
Imports
Exports
Canada Inflation Rate
US Inflation Rate
Government Spending
Leading Economic Index
QE bond purchases
Foreign Direct Investment
Tourist Arrivals

rfuentealba · August 2021

Hello @Hal_Segal

1. My data set will be about 30 years of monthly economic data for Canada and will contain about 30 variables for each time period -- things like gross national product, size of the workforce, the money supply, the interest rate, etc.

According to what you sent, almost all of it is numeric. Rather than using unsupervised tasks, you are trying to look for correlations between several columns.

2. It would be an unsupervised task since we want the algorithm to determine the relationship between the variables.

You'll need to run many of them, many times, to adjust the parameters. I'd start with X-Means to have an idea on how it behaves, and attach a decision tree that serves the purpose of explaining why the cluster composition is like that. I'd begin with this.

3. I understand that RNN with LSTM is the state-of-the-art for this type of problem.

I don't think so, recurrent neural networks are more for processing video, audio or time compositions. If you are planning to do timeframe based data, there is a whole lot of other things in RapidMiner that can help you: time series.

BTW, neural networks are supervised.

4. After the model is up and running, I'd like to test different values of certain independent variables for future time periods. For example, if the government sets the interest rate at x% and sets public spending at $y for the coming 12 months, we want the model to predict all the other variables.

When you want to set a number and get a correlated number, it's normally a regression problem. Remember when I said "use a decision tree to explain why the cluster behaves like that?" The decision tree will inform you what parameters you should use in this regression.

5. How would I proceed to set up this type of task? Are there any particular techniques I should know about to accomplish this?

This is how I should proceed:

Create a new repository to structure your study. I'll give a specific one to you, but feel free to ignore it and do yours.
Import your raw data into RapidMiner Studio.
On one process, perform the basic checks: mean, median, average, standard deviations.
On another process, perform a correlation matrix, so you'll see possible influences.
Create different processes for a number of clustering algorithms that might give you patterns. Unsupervised machine learning algorithms will give you trash data if you don't adjust it properly, therefore both the correlation matrix and the basic checks will help you understand if such a clustering algorithm makes sense or not. On another note: many people tend to think that unsupervised algorithms will do the job for you, but that isn't the case: you then need to interpret those things, and a simple way to do that is a decision tree (or another classification algorithm). You will spend some time here until you get a good classification.
Use the classification to create linear regressions around your correlated attributes. A regression algorithm is the inference of a mathematical function that describes the correlation between two variables. Since you want to use any parameter, that means you'll need at least one linear regression per parameter you want to evaluate, which is rather unpleasant, but to my knowledge there aren't ways that will help you predict two, three or four columns from the other 29.
If you want to adventure yourself, you can change the linear regression algorithm to a neural network, a deep learning extension or something on that note.

Now, you'll need a lot of creativity on each step, but tbh it's not the job of the software to give you the answers you are looking for, it is to drive you to get those answers based in the results the software gave you.

Thanks for your suggestions!

Sorry for the late reply.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Is what I'd like to do even possible with RapidMiner?

Answers