Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal Component Analysis with 3000 variables

    Hi

    My dataset consists of daily Google search volume indices (SVI) for around 3000 firms from 2004 to 2017. Thus, each firm is a time series of SVI having around 5000 daily observations. I want to check whether there is a correlation in the SVI for these 3000 firms. Thus, I need to do a Principal Component Analysis to analyze the presence of any commonality in SVI.

    My data is currently in the form of a panel where firm id is the panel variable and daily date is the time variable. Thus, there are around 3000 panels with approximately 5000 daily observations each.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(firm_id date) byte svi
    1 16071  3
    1 16072 62
    1 16073 53
    1 16074 54
    1 16075 65
    1 16076 78
    1 16077 88
    1 16078 76
    1 16079 73
    1 16080 48
    1 16081 56
    1 16082 86
    1 16083 79
    1 16084 85
    1 16085 95
    1 16086 73
    1 16087 69
    1 16088 67
    1 16089 78
    1 16090 89
    end
    format %td date
    label var firm_id "Firm Identifier" 
    label var date "Daily Date" 
    label var svi "Google SVI"
    I know that I have to reshape the data so that each firm becomes a separate variable. By doing this, I will have 3000 variables (one for each firms's SVI observations; and of course the daily date variable) and the panel variable (i.e. Firm Identifier) will not exist anymore.

    My question is whether Stata can handle a PCA of 3000 variables? I am using a Stata/SE 12.1 version in a machine with RAM of 8 GB. How can such a large covariance matrix be handled?

  • #2
    Welcome to Statalist.

    This is a question you can answer yourself with experimentation.

    The typical advice with problems like this is to start small and work your way up. Reshape your data and try pca with 100 of the variables, then with 200, and so on, seeing what happens. It took about a minute to do 5000 observations of 800 variables for me. I estimate 3200 variables would take about 20 minutes.

    Comment

    Working...
    X