Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • dropping values of a variable conditional on the consecutive values of other variables

    Hi,

    I am dealing with a dataset that has more than a million observations and lots of variables which I am currently cleaning. The period is 2000-2010. A snapshot of some selected variables are below:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int B_sign1 float(kontor_handel year Fstat)
     2212  101 2000  1.625441
    19921 1715 2000  5.087852
    23318 1832 2000  .9362963
    16416  610 2000 196.11215
    15961  102 2000 1.0428013
    20233 1411 2000 128.31055
    19162  119 2000 1.1356107
     2091 1202 2000  86.18385
    20900 1705 2000  75.96806
    22704  910 2000 1.8002303
    20221 1705 2000  75.96806
     7327 2530 2000  201.6177
     9458 1925 2000  85.46902
     1224 1275 2000  30.58243
     3984 2120 2000  132.3299
    18390 1274 2000 2.7553794
      493 1715 2000  5.087852
      548  530 2000 124.24845
    14381 1419 2000 31.884056
    11070 2434 2000 1.0798155
    end
    I have to drop the kontor_handel values for which I do not have Fstat>20 for 3 consecutive years during 2000-2010. For example, if kontor_handel==101 have Fstat>20 for 2008 and 2009 but Fstat<20 for rest of the years, I will drop kontor_handel==101 from the dataset. I am a bit confuse how to do it in a loop. Any help would be appreciated. Here kontor_handel is an office identifier and there are more than 6000 unique kontor_handel.

    Best,
    Zariab

  • #2
    Your example data do not apply here because there is only 2000 in it. I'm creating a fake one as an example instead:
    Code:
    set seed 1712250
    clear
    input kontor_handel
    101
    1715
    end
    expand 11
    bysort kontor_handel: gen year = _n + 1999
    gen Fstat = rpoisson(16)
    Install a package called "rangestat"
    Code:
    ssc install rangestat
    Then, create a binary indicator if Fstat > 20, and then use rangestat to identify which case, and which two following cases:
    Code:
    gen ge20 = (Fstat >= 20) if !missing(Fstat)
    rangestat (max) ge20, interval(year -2 0) by(kontor_handel)
    Results:
    Code:
         +-------------------------------------------+
         | kontor~l   year   Fstat   ge20   ge20_max |
         |-------------------------------------------|
      1. |      101   2000      14      0          0 |
      2. |      101   2001      21      1          1 |
      3. |      101   2002      16      0          1 |
      4. |      101   2003      14      0          1 |
      5. |      101   2004      10      0          0 |
      6. |      101   2005      20      1          1 |
      7. |      101   2006      13      0          1 |
      8. |      101   2007      12      0          1 |
      9. |      101   2008      15      0          0 |
     10. |      101   2009      15      0          0 |
     11. |      101   2010      15      0          0 |
         |-------------------------------------------|
     12. |     1715   2000      12      0          0 |
     13. |     1715   2001      16      0          0 |
     14. |     1715   2002      11      0          0 |
     15. |     1715   2003       8      0          0 |
     16. |     1715   2004      20      1          1 |
     17. |     1715   2005      23      1          1 |
     18. |     1715   2006      19      0          1 |
     19. |     1715   2007      18      0          1 |
     20. |     1715   2008      19      0          0 |
     21. |     1715   2009      13      0          0 |
     22. |     1715   2010      19      0          0 |
         +-------------------------------------------+
    Then, you may use
    Code:
    drop if ge20_max == 1
    to drop the cases.

    Comment


    • #3
      Thanks a lot. I figured it out.

      Comment

      Working...
      X