panel data: BROWSE all year observations using a single observation criterion

Mina Wu

Join Date: Jul 2015

Posts: 79
#1

panel data: BROWSE all year observations using a single observation criterion

08 Sep 2016, 17:20

Hi.

This might be a basic question. For the purpose of a visual inspection I would like to browse all individuals (using id) in panel dataset, that during single year (out of 5 years) have 0 value for variable m, and neighbouring values (m[_n-1], m[_n+1]), some positive value.. so basically, i would like to see set of all year of observations per individual, no matter the observation positive value, as long as one of the observation fulfils the given criterion.

+----------------------------+
| year id m|
|----------------------------|
1. | 2006 6000622 0 |
2. | 2007 6000622 0 |
3. | 2008 6000622 0 |
4. | 2009 6000622 90 |
5. | 2010 6000622 0 |
6. | 2011 6000622 90 |
7. | 2012 6000622 90 |

i tried using bysort year(id): browse if ts=0 but browse command does not work with bysort.

Thanks in advance.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

08 Sep 2016, 17:35

So I think you want this:

Code:

by id (year), sort: egen to_browse = max(m == 0 & m[_n-1] > 0 & m[_n+1] > 0 & !missing(m[_n-1], m[_n+1]))
browse if to_browse

Comment

Mina Wu

Join Date: Jul 2015
Posts: 79

09 Sep 2016, 03:53

@Clyde, thank you!! This is indeed helpful. however, the command picks up individuals who have zero values and are being "bordered" by positive values, one of which belongs to another individual..

Code:

id                year    m
6047262    2006    0
6047262    2007    0
6047262    2008    0
6047262    2009    95.63
6047262    2010    95.63
6047262    2011    95.63
6047262    2012    95.63
6047262    2013    95.63
6047262    2014    95.63
6152473    2006    0
6152473    2007    95.29
6152473    2008    95.29
6152473    2009    95.29
6152473    2010    95.29
6152473    2011    95.29
6152473    2012    95.29
6152473    2013    95.29
6152473    2014    95.29

this is not what i need.. Any chance to avoid this?
Thanks!!

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

09 Sep 2016, 09:39

In the help file for egen, there's a short note about the use of explicit subscripting:

Explicit subscripting (using _N and _n), which is commonly used with generate, should not be used with egen; see subscripting.

This bites here because the max() egen function evaluates the expression without the use of a by prefix. Here's the heart of _gmax.ado:

Code:

        gen double `x'=`exp' 
        sort `touse' `by' `x'
        by `touse' `by': replace `varlist'=/*
            */ cond(`x'<., `x', `varlist'[_n-1]) if `touse'==1
        by `touse' `by': replace `varlist'=`varlist'[_N]

It's interesting to note that the total() egen function would generate the correct results. Here's an example to illustrate this and also the workaround to correctly perform the desired task.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id year m)
6047262 2006     0
6047262 2007     0
6047262 2008     0
6047262 2009 95.63
6047262 2010 95.63
6047262 2011 95.63
6047262 2012 95.63
6047262 2013 95.63
6047262 2014 95.63
6152473 2006     0
6152473 2007 95.29
6152473 2008 95.29
6152473 2009 95.29
6152473 2010 95.29
6152473 2011 95.29
6152473 2012 95.29
6152473 2013 95.29
6152473 2014 95.29
6152474 2012 95.29
6152474 2013     0
6152474 2014 95.29
end

* these two egen calls should generate the same results but do not
isid id year, sort
by id: egen to_browse1 =   max(m == 0 & m[_n-1] > 0 & m[_n+1] > 0 & !missing(m[_n-1], m[_n+1]))
by id: egen to_browse2 = total(m == 0 & m[_n-1] > 0 & m[_n+1] > 0 & !missing(m[_n-1], m[_n+1]))

* to remove explicit subscripting from the egen call, use generate
by id: gen test = m == 0 & m[_n-1] > 0 & m[_n+1] > 0 & !missing(m[_n-1], m[_n+1])
by id: egen to_browse3 = max(test)

list, sepby(id)

and the results

Code:

. list, sepby(id)

     +----------------------------------------------------------------+
     |      id   year       m   to_bro~1   to_bro~2   test   to_bro~3 |
     |----------------------------------------------------------------|
  1. | 6047262   2006       0          0          0      0          0 |
  2. | 6047262   2007       0          0          0      0          0 |
  3. | 6047262   2008       0          0          0      0          0 |
  4. | 6047262   2009   95.63          0          0      0          0 |
  5. | 6047262   2010   95.63          0          0      0          0 |
  6. | 6047262   2011   95.63          0          0      0          0 |
  7. | 6047262   2012   95.63          0          0      0          0 |
  8. | 6047262   2013   95.63          0          0      0          0 |
  9. | 6047262   2014   95.63          0          0      0          0 |
     |----------------------------------------------------------------|
 10. | 6152473   2006       0          1          0      0          0 |
 11. | 6152473   2007   95.29          1          0      0          0 |
 12. | 6152473   2008   95.29          1          0      0          0 |
 13. | 6152473   2009   95.29          1          0      0          0 |
 14. | 6152473   2010   95.29          1          0      0          0 |
 15. | 6152473   2011   95.29          1          0      0          0 |
 16. | 6152473   2012   95.29          1          0      0          0 |
 17. | 6152473   2013   95.29          1          0      0          0 |
 18. | 6152473   2014   95.29          1          0      0          0 |
     |----------------------------------------------------------------|
 19. | 6152474   2012   95.29          1          1      0          1 |
 20. | 6152474   2013       0          1          1      1          1 |
 21. | 6152474   2014   95.29          1          1      0          1 |
     +----------------------------------------------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

09 Sep 2016, 09:51

That is very strange! This works properly in your example:

Code:

by id (year), sort: gen criterion = (m == 0 & m[_n-1] > 0 & m[_n+1] > 0 & !missing(m[_n-1], m[_n+1])) by id (year), sort: egen to_browse = max(criterion)

It seems as if it should be exactly equivalent to my original code in #2, but it produces different (and, for you, correct) results. I don't understand why.

Added: Crossed with Robert's response which explains the difference between the two. Thanks, Robert. I had forgotten that explicit subscripting cannot be relied upon with -egen-.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#6

09 Sep 2016, 10:03

Just as a completely irrelevant note, Stata has a "bysort" function, which in my humble opinion is more elegant/convenient to write than either sort... by () and by(), sort. Of course this is fully in the domain of user preference and changes nothing as far as functionality goes (I think so at least).
Comment
Mina Wu

Join Date: Jul 2015

Posts: 79
#7

10 Sep 2016, 03:38

Thanks you all for the feedback! These two codes seem to produce what I need.

by id (year), sort: gen criterion = (m == 0 & m[_n-1] > 0 & m[_n+1] > 0 & !missing(m[_n-1], m[_n+1])) by id (year), sort: egen to_browse = max(criterion)

by id: egen to_browse2 = total(m == 0 & m[_n-1] > 0 & m[_n+1] > 0 & !missing(m[_n-1], m[_n+1]))
Comment

Mina Wu

Join Date: Jul 2015
Posts: 79

25 Sep 2016, 11:39

I was wondering if I can get the answer on somewhat related question, I have the same dataset described in post #1 (unbalanced panel).

I would like to browse variable x that is zeros or missing, by following criteria(by individual):

X is zero or missing in period 0. X is positive in previous (-1) and later (+1) period. Z and Y are positive in period 0, previous (-1) and later(+1) period
X is zero/missing in period 0 and 1, but is positive in previous (-1) and later period (+2). Z and Y are positive in all periods (-1, 0,1 +1 and +2)
X is zero/missing in period 0, 1, and 2,but is positive in previous (-1) and later (+3) period. Z and Y are positive in all periods (-1, 0,1 +1 and +2 and +3)

Here is the example of the dataset (variables: id, year, Z, Y and X).

[CODE]

Code:

input long id int year float(Z Y X)
6082319 2006  43  40 3.5
6082319 2007  40  37   1
6082319 2008  23  20   0
6082319 2009   8   6   0
6082319 2010  25  19   1
6082319 2013  88  76   1
6082319 2014  78   3   1
6116906 2006  57  40   1
6116906 2007  26  19   1
6116906 2008  41  33   0
6116906 2009  48  41   0
6116906 2010   9   8   0
6116906 2013 134  84   7
6116906 2014  71  22   7
6186688 2006 361 316   1
6186688 2007 149 104   1
6186688 2008 128 111   1
6186688 2009 109  59   .
6186688 2010  98  90   0
6186688 2012  16  20   1
6186688 2013 241  69  15
6186688 2014 230 108  13
6186769 2006  86  13   1
6186769 2007  34   8   0
6186769 2008  54  10   0
6186769 2009  19   1   .
6186769 2010  16 1.5   1
6186769 2013  14   2   1
6186769 2014  28   4 2.5
6287832 2006   7   1   2
6287832 2007  25   4   2
6287832 2008  23   2   1
6287832 2009  19   2   1
6287832 2010  21   2   .
6287832 2013  17   2   2
6287832 2014  19   4   1
6325491 2006  20  13   1
6325491 2007  37  23   0
6325491 2008  40  26   .
6325491 2009  21   8   1
6325491 2013   3   1   0
6358560 2006  83  62   2
6358560 2007 134 107   1
6358560 2008 134 108   1
6358560 2009 113  94   .
6358560 2010  78  70   .
6358560 2013  68  64   0
6358560 2014   1   5   1
6541887 2006 179 156   4
6541887 2007 229 203   4
6541887 2008  55  54   2
6541887 2010  43  41   0
6541887 2013  84  75   0
6541887 2014 247   2   5
6601731 2006  91  53   5
6601731 2007  95  59   7
6601731 2008 103  43   8
6601731 2009  67  14  11
6601731 2010  64  27   6
6601731 2012  37  30   0
6601731 2013  54  37   0
6601731 2014  59  33   1
6681450 2006  49  32   1
6681450 2007  47  30   1
6681450 2008  35  23   1
6681450 2009  33  20   .
6681450 2010  29  19   1
6681450 2013  10   6   1
6770606 2006 125 116   1
6770606 2007 119 106   1
6770606 2008 123 107   0
6770606 2009 133 134   0
6770606 2010 142 129   1
6770606 2012 313 131   1
6770606 2013 559 349   3
6770606 2014 178  26   3
6885748 2006  23   7   2
6885748 2007  30   8   3
6885748 2008  13   4   0
6885748 2009  19   4   0
6885748 2010  28   5   1
6885748 2013  10   3   0
6885748 2014   9   0   1
6991297 2006 224 147   1
6991297 2009  33  23   0
6991297 2010 536 494   0
6991297 2012  64  38   1
6991297 2013  86  57   1
6991297 2014  23   9   1
6996663 2006  31  24   1
6996663 2007  27  21   1
6996663 2008  24  21   1
6996663 2009  23  23   0
6996663 2010   2   2   0
6996663 2013   8   1   1
6996663 2014   7   1   1
6998933 2006  10   4   2
6998933 2007  11   5   .
6998933 2008  10   4   .
6998933 2009   8   1   2

is there a single command to encompass all three criteria? i can always use what was is suggested above, but it is cumbersome to do so and mistake can be easily made.

I also found this post useful in acquiring the answer.

http://www.statalist.org/forums/foru...rvations-panel

however, this post helps encounter for individuals with consecutive values of X=0 or consecutive values of X=. (missing). I also want to inspect the combination of the two criterions, by individual . In my dataset, zero and missing are the same thing..

These values of X, I would like to replace with the average of X in the neighbouring period (which contain positive value of X), conditional that the mentioned criteria holds.

thank you so much in advance!

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

26 Sep 2016, 12:50

You put together a rather complex set of conditions that requires looking up values around the current observation. There's no single-line command that will do this but rangestat (from SSC) can help tremendously in this case. To install it, type in Stata's Command window:

Code:

ssc install rangestat

Looking at your data example, the first thing to notice is that you have gaps in your data. Looking back and forward from the current period will yield missing values. You should declare you data as panel and refer to lag and lead periods using time-series varlists. See help xtset and help tsvarlist.

You can also simplify your task by precomputing what you consider a good value. Since Y and Z are always bunched together in your list of conditions, you can create a single indicator for them:

Code:

gen goodYZ = Y > 0 & Z > 0 & !mi(Y,Z)
gen goodX = X > 0 & !mi(X)

With rangestat, you can sum the number of good values over the desired time window. Here's I believe code that perform all 3 conditions:

Code:

* declare data to be panel
xtset id year

* precompute good values
gen goodYZ = Y > 0 & Z > 0 & !mi(Y,Z)
gen goodX = X > 0 & !mi(X)

* X is zero or missing in period 0. X is positive in previous (-1) and later (+1) period.
* Z and Y are positive in period 0, previous (-1) and later(+1) period
rangestat (sum) goodX goodYZ, interval(year -1 1) by(id)
gen tobrowse1 =  goodX_sum == 2 & goodX == 0 & goodYZ_sum == 3

* X is zero/missing in period 0 and 1, but is positive in previous (-1) and later period (+2).
* Z and Y are positive in all periods (-1, 0,1 +1 and +2)
drop *_sum
rangestat (sum) goodX goodYZ, interval(year -1 2) by(id)
gen tobrowse2 =  goodX_sum == 2 & goodX == 0 & F.goodX == 0 & goodYZ_sum == 4

* X is zero/missing in period 0, 1, and 2,but is positive in previous (-1) and later (+3) period.
* Z and Y are positive in all periods (-1, 0,1 +1 and +2 and +3)
drop *_sum
rangestat (sum) goodX goodYZ, interval(year -1 3) by(id)
rangestat (sum) nX=goodX, interval(year 0 2) by(id)
gen tobrowse3 =  goodX_sum == 2 & nX == 0 & goodYZ_sum == 5

Last edited by Robert Picard; 26 Sep 2016, 13:14.

Comment

Mina Wu

Join Date: Jul 2015

Posts: 79
#10

03 Oct 2016, 16:56

Thank you Robert. This was very helpful!
Comment

Announcement