Large data set with Time Series

Toaldo · November 2020

Dear all - I am kind of new to using RapidMiner. So, I am working with a large data set with Time Series (from 2000 to the 2019 year). There are about ~200.000 lines and 4 different attributes (variable, region, times series, and values). The Decision Tree and Forecasting with Windowing are one of those that are on my radar. Anyway, I am kind of lost here... what type of analysis I could do within this type of database? Thanks in advance for your help! Alexsandro Toaldo

Toaldo · November 2020

Hi Martin -
Thanks for your prompt response.
This is a great question, therefore I am not sure yet.
As a background, I am working with public information about our city (Sao Paulo) which contain about ~200.000 register within 4 different attributes. As this is a time-series dataset, I am not sure where I could start and what type of analysis I can do. The attached file is a sample of the dataset.

Image: https://us.v-cdn.net/6030995/uploads/editor/yq/idavlblmp8uh.jpg

MartinLiebig · November 2020

Hi,

first you likely want to Pivot this whole table to get something like:

Date, Region, Value Of Taxa de Universalizacão, Value Of ... , Value of ...

This is more the data set of interest.

In German we got the saying: To saddle the horse from the wrong side. That's somewhat what you do here. Usually you have a problem and formulate a question to the data you want to answer. You are doing it more the other way around, which is tough.

Besides forecasting a general thing to do with this data may be outlier detection. Are there values which are unexpected? And why? Maybe this helps.

Cheers,

Martin

MartinLiebig · November 2020

Hi,

what is your business problem?

Best,

Martin

Toaldo · November 2020

Dear Martin/Rapid Miner team:

The attached is a template containing public data from our country city.

Image: https://us.v-cdn.net/6030995/uploads/editor/d6/61wyib21d61u.jpg

Under the first column "district" there are approximately 2.243 registers.
Time Series contains data from 1996 to 2019 (~23 years)
Column C to Column WE (approximately 600 different attributes) contains several different information about the data from our city (indexes, GDP, number of males, females, etc and etc). These are very large of data and high quality information.

My intended research approach initially are the following:

1) Start pilot project on 3 neighborhoods to identify correlations and possible regressions to: -Explain the number of industrial and commercial companies per neighborhood (large, medium, small);
2) Select independent variables (10 to 20) explaining the selected independent variable (companies);
3) Decision tree on 10 selected neighborhoods explaining increase on companies;
4) Cluster neighborhoods considering the potential to increase the number of companies.

So, I have a couple of questions:

1) there are many attributes with no values. As this is a large set of data, should I leave it open or change it by zero?
2) What type of operator/analysis should I start the analysis, always considering the "District" as label (every single possible answer should come from Ditrict and size type of organization (large, medium, small).

Thanks for your attention!

Best,
A.Toaldo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Large data set with Time Series

Best Answers

Answers