Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding autocorrelation for individuals in panel data

    I have panel data and I want to find autocorrelation of variable x for each individual, so I do
    Code:
    levelsof id, local(levels)
      foreach i of local levels {
         corr(L.x x) if id == `i'
         replace rho = r(rho) if id == `i'
      }
    It turns out that this command implements really slowly (I have a lot of different values of id). At the same time, when I want to find standard errors of x by
    Code:
    bysort id: egen std = sd(x)
    it is calculated very quickly. It seems like finding autocorrelation is not much more computationally difficult. Is there any way to do it faster?

    Thanks.
    Last edited by Vasisualiy Lokhankin; 12 Jan 2017, 02:00.

  • #2
    There are several procedures that allow statistics by groups - look at statsby and rolling. They may be more efficient than your loop.

    Comment


    • #3
      The problem stems from the fact that Stata will always use all data internally even if you specify an if-condition. I don't know the underlying technical reason, but it is definitely true that commands that use the exact same data (based on the if-condition) take longer in larger datasets (i.e. if you add a bunch of data that does not meet the if-condition). As a result, each iteration of your loop will take super long even if it only needs a tiny section of the data.

      Before we talk about potential solutions, I want to ask why you need this information. Panel serial correlation tests (e.g. -xtqptest-) are more powerful than the combination of time series tests. Moreover, you have to be careful about the multiple hypothesis testing issue when doing many individual tests (if you take p = 0.05 as cut-off, you'd expect 5% of your correlations to be significantly different from zero even if they are all zero).

      Comment


      • #4
        I don't know the underlying technical reason, but it is definitely true that commands that use the exact same data (based on the if-condition) take longer in larger datasets (i.e. if you add a bunch of data that does not meet the if-condition).
        The technical reason is that when an -if- condition is applied, Stata must run through the data set and check each observation to determine which ones fulfill the -if- condition.

        When this slow down becomes a serious problem, one can work around it by sorting the data so as to put all the -if- compliant conditions into a block of consecutive observations first, and then applying the corresponding -in- condition. Stata does not need to look at each observation to decide whether it meets an -in- condition, so the process goes more quickly. The time and effort spent coding this workaround, depending on the particular commands, the complexity of the condition, and other aspects of the data, can be appreciable. And it is also an opportunity to introduce errors. So I generally recommend patience instead. Nevertheless, some processes really do take unreasonably long when done using -if-, and the benefits of working around it with -in- can outweigh the downsides.

        Comment


        • #5
          Code:
          . set seed 1
          . clear all
          .
          . *** Calculations on full sample
          . set obs 100000000
          . gen y = 2
          . gen x = 1 if y >= 2
          
          . timeit 1: sum x if y >= 2
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     x |100,000,000           1           0          1          1
          
          . timeit 2: sum x in 1/l
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     x |100,000,000           1           0          1          1
          
          . timeit 3: sum x
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     x |100,000,000           1           0          1          1
          
          . timer list
             1:      5.69 /        1 =       5.6940
             2:      2.01 /        1 =       2.0100
             3:      2.00 /        1 =       1.9970
          .
          . *** Calculations on half the sample
          . drop x y
          . set obs 100000000
          . gen y = rnormal()
          . gen x = 1 if y < 0
          .
          . timeit 11: sum x if y < 0
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     x | 49,999,755           1           0          1          1
          
          . timeit 98: sort y
          
          . timeit 12: sum x in 1/`=_N/2'
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     x | 49,999,755           1           0          1          1
          .
          . timeit 99: drop if y > 0
          (50,000,245 observations deleted)
          
          . timeit 13: sum x
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     x | 49,999,755           1           0          1          1
          
          . timer list
            11:      5.27 /        1 =       5.2680
            12:      1.02 /        1 =       1.0200
            13:      1.02 /        1 =       1.0170
            98:    225.64 /        1 =     225.6420
            99:      5.78 /        1 =       5.7840
          You appear to be correct. The sum command takes just as long with the in condition as in a reduced sample, but significantly longer when there's an if condition instead.

          Comment


          • #6
            Dear all,

            I have tried this code but it seems something is missing. The last row (replace rho = r(rho) if id == `i') shows the results but does not save it to the variable rho:
            88 real changes made, 88 to missing.


            -1 0 1 -1 0 1
            LAG AC PAC Q Prob>Q [Autocorrelation] [Partial Autocor]
            -------------------------------------------------------------------------------
            1 0.8150 0.8388 60.474 0.0000 |------ |------
            2 0.7669 0.3910 114.63 0.0000 |------ |---
            3 0.6960 0.0915 159.77 0.0000 |----- |
            4 0.6347 -0.0405 197.75 0.0000 |----- |
            5 0.5877 -0.0081 230.71 0.0000 |---- |
            6 0.5522 -0.0150 260.16 0.0000 |---- |
            7 0.5049 -0.0460 285.08 0.0000 |---- |
            8 0.5052 0.1054 310.35 0.0000 |---- |
            9 0.5163 0.1840 337.07 0.0000 |---- |-
            10 0.4980 0.0835 362.26 0.0000 |--- |
            11 0.4678 -0.0607 384.77 0.0000 |--- |
            12 0.4576 0.0501 406.59 0.0000 |--- |
            13 0.4281 -0.0538 425.95 0.0000 |--- |
            14 0.3573 -0.3203 439.61 0.0000 |-- --|
            15 0.3561 0.0862 453.37 0.0000 |-- |
            16 0.2690 -0.1881 461.32 0.0000 |-- -|
            17 0.2158 -0.1628 466.52 0.0000 |- -|
            18 0.1727 -0.1656 469.89 0.0000 |- -|
            19 0.1246 -0.1406 471.67 0.0000 | -|
            20 0.0685 -0.1845 472.22 0.0000 | -|
            21 0.0926 0.2368 473.23 0.0000 | |-
            22 0.0367 -0.0675 473.4 0.0000 | |
            23 0.0072 -0.0491 473.4 0.0000 | |
            24 -0.0049 0.0403 473.41 0.0000 | |
            25 -0.0467 -0.0701 473.68 0.0000 | |
            26 -0.0908 0.0348 474.73 0.0000 | |
            27 -0.1471 -0.1941 477.54 0.0000 -| -|
            28 -0.1881 0.0398 482.21 0.0000 -| |
            29 -0.2305 0.1325 489.34 0.0000 -| |-
            30 -0.2685 -0.1290 499.19 0.0000 --| -|
            31 -0.2750 0.2169 509.7 0.0000 --| |-
            32 -0.3306 -0.3822 525.15 0.0000 --| ---|
            33 -0.3376 -0.1334 541.56 0.0000 --| -|
            34 -0.3404 -0.2852 558.55 0.0000 --| --|
            35 -0.3437 0.2702 576.2 0.0000 --| |--
            36 -0.3431 -0.0479 594.13 0.0000 --| |
            37 -0.3636 0.0655 614.65 0.0000 --| |
            38 -0.3877 0.0343 638.46 0.0000 ---| |
            39 -0.3894 -0.3165 662.97 0.0000 ---| --|
            40 -0.3999 -0.0235 689.36 0.0000 ---| |
            (88 real changes made, 88 to missing)
            (note: time series has 5 gaps)


            In the end of the session stata errors: insufficient observations r(2001).

            Have you any idea how can I solve these two problems?

            Best regards, Farid

            Comment

            Working...
            X