Counting distinct observation in time range by -inrange-

Melanie Zhang

Join Date: Nov 2016

Posts: 18
#1

Counting distinct observation in time range by -inrange-

25 Sep 2017, 07:47

Dear statalists,

I would like to count distinct observations based on certain criteria for previous three years, and save it as a dataset, but it seems I got some problem with my coding. The example below would help to show what I intend to do.

Dataset

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int Year float(Loc Group) str8 BName long Country float id 2001 1 2 "B" 1 1 2001 1 3 "A" 2 2 2001 1 3 "K" 2 3 2001 2 1 "A" 2 4 2001 2 2 "C" 1 5 2003 1 2 "C" 1 6 2003 1 3 "A" 2 7 2003 2 2 "B" 1 8 2003 2 3 "K" 2 9 2004 1 2 "C" 1 10 2004 1 3 "K" 2 11 2004 2 2 "B" 1 12 2004 2 3 "K" 2 13 2005 1 2 "B" 1 14 2005 1 3 "K" 2 15 2005 2 2 "B" 1 16 end label values Country Country label def Country 1 "Fr", modify label def Country 2 "UK", modify

I would like to count distinct BNames that shows up in same Loc, Group, Country in the prior three year (as the concept of "number of peers"). For example, in 2004, when Loc==1, Group==2 and Country==Fr, the count would be 1, as both id_6 and id_10 show the same BName, C; but for the same criteria in 2005 the count would be 2, as in id_14 the BName is B, which is different from id_6 and id_10. I have tried the code like this

Code:

bysort Year Loc Group Country: egen sum=nvals(BName) if inrange(Year,Year[_n-2],Year[_n]) bysort Year Loc Group Country: gen _flag=_n keep if _flag==1 keep Year Loc Group Country sum

It turns out to count peers in the same year only, while not count peers in previous years. I'm not sure that is the correct way to use -inrange- command, or I should change the entire code. I would much appreciate if anyone can help or provide any suggestions. Many thanks in advanced!

Kind Regards,

Melanie

Last edited by Melanie Zhang; 25 Sep 2017, 07:48. Reason: inrange
Tags: distinct observation, egen, inrange

Nick Cox

Join Date: Mar 2014
Posts: 35709

25 Sep 2017, 08:24

The help for egen includes this warning:

Explicit subscripting (using _N and _n), which is commonly used with generate, should not be used with egen; see subscripting.

The bigger deal here is that your framework

Code:

bysort Year ...: egen sum=nvals(BName)

inevitably puts different years in different groups, but you want to compare years. Also, you say "previous three years" but your code hints that you really want this year and the previous two. Further, you use the user-written program behind nvals() but do not cite its source, egenmore from SSC. This may help. I use rangerun from SSC.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int Year float(Loc Group) str8 BName long Country float id
2001 1 2 "B" 1  1
2001 1 3 "A" 2  2
2001 1 3 "K" 2  3
2001 2 1 "A" 2  4
2001 2 2 "C" 1  5
2003 1 2 "C" 1  6
2003 1 3 "A" 2  7
2003 2 2 "B" 1  8
2003 2 3 "K" 2  9
2004 1 2 "C" 1 10
2004 1 3 "K" 2 11
2004 2 2 "B" 1 12
2004 2 3 "K" 2 13
2005 1 2 "B" 1 14
2005 1 3 "K" 2 15
2005 2 2 "B" 1 16
end
label values Country Country
label def Country 1 "Fr", modify
label def Country 2 "UK", modify

encode BName, gen(BValue) 

sort Country Loc Group Year 


program myprog
   qui tab BValue 
   gen ndistinct = r(r)        
end

rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group)

list, sepby(Country Loc Group) 

     +---------------------------------------------------------------+
     | Year   Loc   Group   BName   Country   id   BValue   ndisti~t |
     |---------------------------------------------------------------|
  1. | 2001     1       2       B        Fr    1        B          1 |
  2. | 2003     1       2       C        Fr    6        C          2 |
  3. | 2004     1       2       C        Fr   10        C          1 |
  4. | 2005     1       2       B        Fr   14        B          2 |
     |---------------------------------------------------------------|
  5. | 2001     2       2       C        Fr    5        C          1 |
  6. | 2003     2       2       B        Fr    8        B          2 |
  7. | 2004     2       2       B        Fr   12        B          1 |
  8. | 2005     2       2       B        Fr   16        B          1 |
     |---------------------------------------------------------------|
  9. | 2001     1       3       A        UK    2        A          2 |
 10. | 2001     1       3       K        UK    3        K          2 |
 11. | 2003     1       3       A        UK    7        A          2 |
 12. | 2004     1       3       K        UK   11        K          2 |
 13. | 2005     1       3       K        UK   15        K          2 |
     |---------------------------------------------------------------|
 14. | 2001     2       1       A        UK    4        A          1 |
     |---------------------------------------------------------------|
 15. | 2003     2       3       K        UK    9        K          1 |
 16. | 2004     2       3       K        UK   13        K          1 |
     +---------------------------------------------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30112
#3

25 Sep 2017, 08:31

If I understand your question correctly, you want:

Code:

encode BName, gen(bname) rangestat (count) bname, by(Loc Group Country) interval(Year -3 -1)

Notes:
1. If you don't already have it, you need to get Robert Picard, Nick Cox & Roberto Ferrer's -rangestat- command. -ssc install rangestat-. Do read -help rangestat- to understand how it works so that you can tweak this command if it isn't exactly right.

2. -rangestat- only handles numeric variables, which is why I first -encode- BName.

Added: Crossed with #2. Nick and I have interpreted the question differently. Nick's solution counts the number of distinct values of BName, mine counts the total number of them (whether distinct or not). It is not clear to me after re-reading your post which you want.

Last edited by Clyde Schechter; 25 Sep 2017, 08:34.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35709
#4

25 Sep 2017, 08:38

Clyde Schechter I took the word "distinct" as a flag, although it's a guess.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30112
#5

25 Sep 2017, 08:45

I agree that the word distinct is there. But in reading the detailed explanation leading up to the results, it seemed as if distinctness didn't actually matter.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35709
#6

25 Sep 2017, 09:01

We're guessing differently on "previous" too.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30112
#7

25 Sep 2017, 09:14

Yes, we are. It's not coincidental that I began my response in #3 with "If I understand your question correctly..."
Comment
Melanie Zhang

Join Date: Nov 2016

Posts: 18
#8

25 Sep 2017, 10:14

Originally posted by Nick Cox View Post

The help for egen includes this warning:

The bigger deal here is that your framework

Code:

bysort Year ...: egen sum=nvals(BName)

inevitably puts different years in different groups, but you want to compare years. Also, you say "previous three years" but your code hints that you really want this year and the previous two. Further, you use the user-written program behind nvals() but do not cite its source, egenmore from SSC. This may help. I use rangerun from SSC.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int Year float(Loc Group) str8 BName long Country float id 2001 1 2 "B" 1 1 2001 1 3 "A" 2 2 2001 1 3 "K" 2 3 2001 2 1 "A" 2 4 2001 2 2 "C" 1 5 2003 1 2 "C" 1 6 2003 1 3 "A" 2 7 2003 2 2 "B" 1 8 2003 2 3 "K" 2 9 2004 1 2 "C" 1 10 2004 1 3 "K" 2 11 2004 2 2 "B" 1 12 2004 2 3 "K" 2 13 2005 1 2 "B" 1 14 2005 1 3 "K" 2 15 2005 2 2 "B" 1 16 end label values Country Country label def Country 1 "Fr", modify label def Country 2 "UK", modify encode BName, gen(BValue) sort Country Loc Group Year program myprog qui tab BValue gen ndistinct = r(r) end rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group) list, sepby(Country Loc Group) +---------------------------------------------------------------+ | Year Loc Group BName Country id BValue ndisti~t | |---------------------------------------------------------------| 1. | 2001 1 2 B Fr 1 B 1 | 2. | 2003 1 2 C Fr 6 C 2 | 3. | 2004 1 2 C Fr 10 C 1 | 4. | 2005 1 2 B Fr 14 B 2 | |---------------------------------------------------------------| 5. | 2001 2 2 C Fr 5 C 1 | 6. | 2003 2 2 B Fr 8 B 2 | 7. | 2004 2 2 B Fr 12 B 1 | 8. | 2005 2 2 B Fr 16 B 1 | |---------------------------------------------------------------| 9. | 2001 1 3 A UK 2 A 2 | 10. | 2001 1 3 K UK 3 K 2 | 11. | 2003 1 3 A UK 7 A 2 | 12. | 2004 1 3 K UK 11 K 2 | 13. | 2005 1 3 K UK 15 K 2 | |---------------------------------------------------------------| 14. | 2001 2 1 A UK 4 A 1 | |---------------------------------------------------------------| 15. | 2003 2 3 K UK 9 K 1 | 16. | 2004 2 3 K UK 13 K 1 | +---------------------------------------------------------------+

Hi Nick and Clyde!

Thank you both very much for the fast response and detailed discussion! I'm still learning the codes Nick drafted, and it is really helpful. I just realised the might be some confusions in my descriptions, which I do apologise:

1. In terms of "distinct BName", I might not manage to make a good dataset example, but I was just trying to say that if one BName/BValue turns up more than one times in the 3-year range, it should still count as 1 rather than the times it turns up. I previously counted the "peers" in one year, and learned the counting distinct value code from the FAQ that Nick wrote
https://www.stata.com/support/faqs/d...stinct-values/
So if I just generate _n by sorting Year, a same BValue may count in separate groups, which should not be the case if I waive my time window into three-year.

2. My apologies on the year range description. As my previous idea was to count the peer number, save it as a separate file, and merge it again with major dataset by Year[_n-1], the count by (Year -2 0) is supposed to measures the peers in (Year -3 0). Nick's code actually improve my original thought so I don't have to re-merge the data.

Thank you both again for the discussion! I will post again if I come across any further questions.

Best,

Melanie
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#9

04 Sep 2018, 07:37

First, I would like to say thanks to Clyde and Nick for all your help on the board and for the creating/writing the rangejoin/rangerun/rangestat commands.

Revisiting this post, I have a related question. Using the same example, is it possible to save the unique Bname/Bvalues in the previous 3 years (for my sake it doesn't matter if that means -2 0 or -3 -1).

For example, in year=2003, country="Fr", loc=2, group=2 there are 2 Bnames (B & C). These are obs 5 & 6 from the original data. I would like to save the two Bnames (B & C), maybe in the format

columns = year country loc group Bname
row1 = 2003 Fr 2 2 B
row2 = 2003 Fr 2 2 C

and so on.

Is this possible?

The current output is different. For example, after

Code:

list, sepby(Country Loc Group)

Row 6 states 2 distinct Bname values (ndistinct=2), but there is only one such row (year=2003, loc=2, group=2, country="Fr", Bname="B"). The second distinct obs for Row 6 is in Row 5 (year=2001, loc=2, group=2, country="Fr", Bname="C"). I would like the second distinct obs to be output as (year=2003, loc=2, group=2, country="Fr", Bname="C"). The main difference is the year would change from 2001 to 2003 (but I would like to also retain the 2001 row (observation #5) as that is the 1 distinct obs for (year=2001, loc=2, group=2, country="Fr")). Or maybe there is a simpler way? But I am looking to save the distinct Bnames, not the number of distinct Bnames.

Thanks in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30112
#10

04 Sep 2018, 08:38

First, no credit where credit is not due. I had no role in the development of -rangejoin-, -rangestat-, or -rangerun-, and Nick Cox contributed only to -rangestat- among those. The -rangejoin- and -rangerun- commands, although they build on -rangestat-, are solely authored by Robert Picard. (You may be confused because I did collaborate with Robert Picard on the development of -runby-, which does something rather different, and Nick has collaborated with Robert on a number of other related and unrelated commands.)

If I understand your question correctly, what you want is something like:

Code:

preserve tempfile copy keep Loc Group Country BName Year save `copy' restore rangejoin Year -3 -1 using `copy', by(Loc Group Country)

Finally, when talking about Stata data sets it is best to avoid the terms "row" and "column." That is spreadsheet terminology. Apart from a superficial resemblance of Stata data sets to spreadsheets when viewed in the Browser/Editor, they are very different objects. When you use the terms row and column, you encourage your brain to think about analyzing data in Stata the way you might in a spreadsheet. But those spreadsheet-honed instincts will generally be unhelpful, and sometimes lead to serious problems, when working in Stata. It is best to try to separate the two in your mind as much as possible: you will work better in both domains if you do so. What you have referred to as rows are, in Stata-speak, observations, and the analog of columns is variables.
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#11

04 Sep 2018, 09:29

I knew that Nick and Robert worked on those 3 commands. But since I was writing to both of you, I thought I could thank both he and you. I know you work on different things, I didn't mean to thank you for those commands, just to thank you both in general for what you do. All three of you (Robert, Nick and yourself) are a HUGE help to the rest of us. (There are others as well who I am not thanking here.)

I will try to think NOT in a spreadsheet way. The matrix way just seems intuitive to differentiate between observations (rows) and variables (columns). But I will try to do as you suggest.

Your code works well. Thank you. Small additional request: is there a way to put the first value in the same variable as the others. For example, if we add another obs and now have more than 2 distinct Bnames,

Code:

clear input int Year float(Loc Group) str8 BName long Country float id 2001 1 2 "B" 1 1 2001 1 3 "A" 2 2 2001 1 3 "K" 2 3 2001 2 1 "A" 2 4 2001 2 2 "C" 1 5 2003 1 2 "W" 1 6 2003 1 2 "C" 1 7 2003 1 3 "A" 2 8 2003 2 2 "B" 1 9 2003 2 3 "K" 2 10 2004 1 2 "C" 1 11 2004 1 3 "K" 2 12 2004 2 2 "B" 1 13 2004 2 3 "K" 2 14 2005 1 2 "B" 1 15 2005 1 3 "K" 2 16 2005 2 2 "B" 1 17 end label values Country Country label def Country 1 "Fr", modify label def Country 2 "UK", modify encode BName, gen(BValue) sort Country Loc Group Year program myprog qui tab BValue gen ndistinct = r(r) end rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group) list, sepby(Country Loc Group) preserve tempfile copy keep Loc Group Country BName Year save `copy' restore rangejoin Year -3 -1 using `copy', by(Loc Group Country)

The first unique obs shows in the "BName" variable while the second, third, etc unique BName values show up in "BName_U" variable. Is there a simple way to put all the unique values in the same variable?

In the output for the above code, could we insert an observation where year=2003, loc=1, group=2, country="Fr", and BName_U="W")?

(Note: I don't know if adding a new row (row # 6) into the above code was valuable? I just wanted to make sure it worked if there were more than 2 unique obs.)

One more question, should the timing in the code you wrote be the same as in the above rangerun command? Now rangerun has -2 0 and your rangejoin has -3 -1.

Thanks in advance.

Last edited by Kyle Smith; 04 Sep 2018, 10:06.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30112

#12

04 Sep 2018, 10:58

Yes, the -rangerun- and -rangejoin- should use the same intervals; my mistake.

I'm not completely clear on what you want the results to look like. The -rangejoin- command will find all the appropriate values of BName, some of which may appear more than once (say the same BName occurs in both 2001 and 2002 for matching to a 2003 observation). You don't say it in so many words, but I'm sensing that what you want ultimately is a list of distinct BNames all stacked into single variable. If so, you can do it this way:

Code:

clear*
input int Year float(Loc Group) str8 BName long Country float id
2001 1 2 "B" 1 1
2001 1 3 "A" 2 2
2001 1 3 "K" 2 3
2001 2 1 "A" 2 4
2001 2 2 "C" 1 5
2003 1 2 "W" 1 6
2003 1 2 "C" 1 7
2003 1 3 "A" 2 8
2003 2 2 "B" 1 9
2003 2 3 "K" 2 10
2004 1 2 "C" 1 11
2004 1 3 "K" 2 12
2004 2 2 "B" 1 13
2004 2 3 "K" 2 14
2005 1 2 "B" 1 15
2005 1 3 "K" 2 16
2005 2 2 "B" 1 17
end
label values Country Country
label def Country 1 "Fr", modify
label def Country 2 "UK", modify

encode BName, gen(BValue)

isid id

sort Country Loc Group Year



program myprog
qui tab BValue
gen ndistinct = r(r)
end

rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group)


preserve
tempfile copy
keep Loc Group Country BName Year
save `copy'
restore

rangejoin Year -2 0 using `copy', by(Loc Group Country)

by id, sort: gen expander = cond(_n == 1 & !missing(BName_U), 2, 1)
expand expander
by id, sort: replace BName_U = BName if _n == 1
by id BName_U, sort: keep if _n == 1

Comment

Kyle Smith

Join Date: Mar 2017

Posts: 117
#13

04 Sep 2018, 11:20

Clyde,

"You da man." That's exactly what I was looking for. Thanks.
Comment

Announcement