Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to handle 3-dimensional ("panel") data?

    Hi,

    I have a dataset of individual daily investor trading data. In total, there are about 1 million observations containing about 40,000 distinct investors with on average 25 trades each.
    The data is 3-dimensional in the sense that there is a time variable date (in days), an investorID variable and a stockID variable.

    Let's say I would like to investigate the effect of some exogenous day- and stock-specific signal (like an analyst forecast or a news annoucement on that particular stock) on the volume traded by each investor per day per stock.

    Example of the data for 2 investorIDs:


    Code:
    clear
    input float date int stockID double(investorID volume) float signal
    17591 128 1   13 0
    17591 449 1   80 0
    17885  61 1   80 0
    17885 686 1   60 1
    17896 449 1  350 0
    17896 752 1   80 0
    18155 743 1  250 0
    18851 760 1 1000 1
    16502 775 2   50 0
    16628 698 2   50 0
    17021 625 2   13 0
    17021 625 2   37 0
    17554 775 2  100 0
    17793 585 2   50 0
    17793 752 2   50 0
    17805 752 2   50 0
    17815  61 2   50 0
    17815 585 2  100 1
    17815 585 2  100 1
    17821  75 2   50 0
    17821 591 2   50 0
    17821 752 2  100 0
    18522  61 2   50 0
    18913  61 2   50 0
    18913 760 2  200 0
    end
    format %td date


    I tried the following two approaches:

    1.) I collapsed the data by summing the trading volume per day per stock ID. This eliminates my investor ID-dimension (as all the volume on one day in one stock is aggregated) and I receive panel data which I can group on stock ID over time.
    This yields me about 500 groups for the stock IDs. If I run
    Code:
    xtset stockID date
    xtreg volume signal CONTROLS, cluster(stockID) fe
    the aimed effect of variable signal on volume is not there:

    Code:
    Fixed-effects (within) regression               Number of obs      =    753191
    Group variable: stockID                       Number of groups   =       451
    
    R-sq:  within  = 0.0615                         Obs per group: min =        10
           between = 0.3930                                        avg =    1479.0
           overall = 0.1741                                        max =      1878
    
                                                    F(120,489)         =      7.86
    corr(u_i, Xb)  = -0.0281                        Prob > F           =    0.0000
    
                                    (Std. Err. adjusted for 451 clusters in stockID)
    --------------------------------------------------------------------------------
                   |               Robust
    volume |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ---------------+----------------------------------------------------------------
      signal |   -2.11476    2.80901    -0.75   0.452    -7.620337    3.390816


    2.) Actually, I do not want to sum volume across investors. Therefore, I tried to cope with the 3 dimensions by collapsing the the data by date, investorID and stockID such that the resulting dataset contains summed volume on the individual investor level per day (some investors trade a specific stock multiple times per day, that's why I had to do this).
    Then I run
    Code:
     egen grouping = group(investorID stockID)
    xtset grouping date
    xreg volume signal CONTROLS, cluster(grouping) fe
    In this case I get the following but with lower R-squared and an incredibly high number of groups as compared to observations, of course.

    Code:
    Fixed-effects (within) regression               Number of obs      =    854643
    Group variable: grouping                        Number of groups   =    367014
    
    R-sq:  within  = 0.0222                         Obs per group: min =         1
           between = 0.0368                                        avg =       2.3
           overall = 0.0344                                        max =       257
    
                                                    F(117,368003)      =     33.31
    corr(u_i, Xb)  = -0.2855                        Prob > F           =    0.0000
    
                                (Std. Err. adjusted for 367014 clusters in grouping)
    --------------------------------------------------------------------------------
                   |               Robust
      volume |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ---------------+----------------------------------------------------------------
       signal |   15.64934   3.922797     3.99   0.000     7.960774    23.33791

    I am no expert on panel data regressions. Is it "common"/acceptable to have such a high number of groups in panel data? Is there a better approach that copes with my issue?

    Any comments are very welcome. Thank you!
    Last edited by Rolf Miller; 20 Feb 2019, 05:56.

  • #2
    Rolf:
    perhaps a different approach would entail to reduce your dimension from 3 to 2 by classifying stocks in different industries via a categorical variables and use it as a predictor:
    Code:
    xtset investors date
    xreg volume signal CONTROLS i.stock, cluster(investors) fe
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Rolf:
      perhaps a different approach would entail to reduce your dimension from 3 to 2 by classifying stocks in different industries via a categorical variables and use it as a predictor:
      Code:
      xtset investors date
      xreg volume signal CONTROLS i.stock, cluster(investors) fe
      Dear Carlo,
      Thanks for your input.
      I think this approach is difficult as
      Code:
      xtset investorID date
      would require that I aggregate positions across stockIDs which is not possible as die signal variable is stock-specific.

      Comment


      • #4
        Rolf:
        I see the issue.
        Second try: can't you group stock with similar in whatever respect and create a categorical variable to be included as a predictor in the right-hand side of your regression equation?
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment

        Working...
        X