Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting distinct observation in time range by -inrange-

    Dear statalists,

    I would like to count distinct observations based on certain criteria for previous three years, and save it as a dataset, but it seems I got some problem with my coding. The example below would help to show what I intend to do.

    Dataset
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int Year float(Loc Group) str8 BName long Country float id
    2001 1 2 "B" 1  1
    2001 1 3 "A" 2  2
    2001 1 3 "K" 2  3
    2001 2 1 "A" 2  4
    2001 2 2 "C" 1  5
    2003 1 2 "C" 1  6
    2003 1 3 "A" 2  7
    2003 2 2 "B" 1  8
    2003 2 3 "K" 2  9
    2004 1 2 "C" 1 10
    2004 1 3 "K" 2 11
    2004 2 2 "B" 1 12
    2004 2 3 "K" 2 13
    2005 1 2 "B" 1 14
    2005 1 3 "K" 2 15
    2005 2 2 "B" 1 16
    end
    label values Country Country
    label def Country 1 "Fr", modify
    label def Country 2 "UK", modify
    I would like to count distinct BNames that shows up in same Loc, Group, Country in the prior three year (as the concept of "number of peers"). For example, in 2004, when Loc==1, Group==2 and Country==Fr, the count would be 1, as both id_6 and id_10 show the same BName, C; but for the same criteria in 2005 the count would be 2, as in id_14 the BName is B, which is different from id_6 and id_10. I have tried the code like this

    Code:
    bysort Year Loc Group Country: egen sum=nvals(BName) if inrange(Year,Year[_n-2],Year[_n])
    bysort Year Loc Group Country: gen _flag=_n
    keep if _flag==1
    keep Year Loc Group Country sum

    It turns out to count peers in the same year only, while not count peers in previous years. I'm not sure that is the correct way to use -inrange- command, or I should change the entire code. I would much appreciate if anyone can help or provide any suggestions. Many thanks in advanced!


    Kind Regards,

    Melanie
    Last edited by Melanie Zhang; 25 Sep 2017, 07:48. Reason: inrange

  • #2
    The help for egen includes this warning:

    Explicit subscripting (using _N and _n), which is commonly used with generate, should not be used with egen; see subscripting.
    The bigger deal here is that your framework
    Code:
    bysort Year ...: egen sum=nvals(BName)
    inevitably puts different years in different groups, but you want to compare years. Also, you say "previous three years" but your code hints that you really want this year and the previous two. Further, you use the user-written program behind nvals() but do not cite its source, egenmore from SSC. This may help. I use rangerun from SSC.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int Year float(Loc Group) str8 BName long Country float id
    2001 1 2 "B" 1  1
    2001 1 3 "A" 2  2
    2001 1 3 "K" 2  3
    2001 2 1 "A" 2  4
    2001 2 2 "C" 1  5
    2003 1 2 "C" 1  6
    2003 1 3 "A" 2  7
    2003 2 2 "B" 1  8
    2003 2 3 "K" 2  9
    2004 1 2 "C" 1 10
    2004 1 3 "K" 2 11
    2004 2 2 "B" 1 12
    2004 2 3 "K" 2 13
    2005 1 2 "B" 1 14
    2005 1 3 "K" 2 15
    2005 2 2 "B" 1 16
    end
    label values Country Country
    label def Country 1 "Fr", modify
    label def Country 2 "UK", modify
    
    encode BName, gen(BValue) 
    
    sort Country Loc Group Year 
    
    
    program myprog
       qui tab BValue 
       gen ndistinct = r(r)        
    end
    
    rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group)
    
    list, sepby(Country Loc Group) 
    
         +---------------------------------------------------------------+
         | Year   Loc   Group   BName   Country   id   BValue   ndisti~t |
         |---------------------------------------------------------------|
      1. | 2001     1       2       B        Fr    1        B          1 |
      2. | 2003     1       2       C        Fr    6        C          2 |
      3. | 2004     1       2       C        Fr   10        C          1 |
      4. | 2005     1       2       B        Fr   14        B          2 |
         |---------------------------------------------------------------|
      5. | 2001     2       2       C        Fr    5        C          1 |
      6. | 2003     2       2       B        Fr    8        B          2 |
      7. | 2004     2       2       B        Fr   12        B          1 |
      8. | 2005     2       2       B        Fr   16        B          1 |
         |---------------------------------------------------------------|
      9. | 2001     1       3       A        UK    2        A          2 |
     10. | 2001     1       3       K        UK    3        K          2 |
     11. | 2003     1       3       A        UK    7        A          2 |
     12. | 2004     1       3       K        UK   11        K          2 |
     13. | 2005     1       3       K        UK   15        K          2 |
         |---------------------------------------------------------------|
     14. | 2001     2       1       A        UK    4        A          1 |
         |---------------------------------------------------------------|
     15. | 2003     2       3       K        UK    9        K          1 |
     16. | 2004     2       3       K        UK   13        K          1 |
         +---------------------------------------------------------------+

    Comment


    • #3
      If I understand your question correctly, you want:
      Code:
      encode BName, gen(bname)
      
      rangestat (count) bname, by(Loc Group Country) interval(Year -3 -1)
      Notes:
      1. If you don't already have it, you need to get Robert Picard, Nick Cox & Roberto Ferrer's -rangestat- command. -ssc install rangestat-. Do read -help rangestat- to understand how it works so that you can tweak this command if it isn't exactly right.

      2. -rangestat- only handles numeric variables, which is why I first -encode- BName.

      Added: Crossed with #2. Nick and I have interpreted the question differently. Nick's solution counts the number of distinct values of BName, mine counts the total number of them (whether distinct or not). It is not clear to me after re-reading your post which you want.
      Last edited by Clyde Schechter; 25 Sep 2017, 08:34.

      Comment


      • #4
        Clyde Schechter I took the word "distinct" as a flag, although it's a guess.

        Comment


        • #5
          I agree that the word distinct is there. But in reading the detailed explanation leading up to the results, it seemed as if distinctness didn't actually matter.

          Comment


          • #6
            We're guessing differently on "previous" too.

            Comment


            • #7
              Yes, we are. It's not coincidental that I began my response in #3 with "If I understand your question correctly..."

              Comment


              • #8
                Originally posted by Nick Cox View Post
                The help for egen includes this warning:



                The bigger deal here is that your framework
                Code:
                bysort Year ...: egen sum=nvals(BName)
                inevitably puts different years in different groups, but you want to compare years. Also, you say "previous three years" but your code hints that you really want this year and the previous two. Further, you use the user-written program behind nvals() but do not cite its source, egenmore from SSC. This may help. I use rangerun from SSC.
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input int Year float(Loc Group) str8 BName long Country float id
                2001 1 2 "B" 1 1
                2001 1 3 "A" 2 2
                2001 1 3 "K" 2 3
                2001 2 1 "A" 2 4
                2001 2 2 "C" 1 5
                2003 1 2 "C" 1 6
                2003 1 3 "A" 2 7
                2003 2 2 "B" 1 8
                2003 2 3 "K" 2 9
                2004 1 2 "C" 1 10
                2004 1 3 "K" 2 11
                2004 2 2 "B" 1 12
                2004 2 3 "K" 2 13
                2005 1 2 "B" 1 14
                2005 1 3 "K" 2 15
                2005 2 2 "B" 1 16
                end
                label values Country Country
                label def Country 1 "Fr", modify
                label def Country 2 "UK", modify
                
                encode BName, gen(BValue)
                
                sort Country Loc Group Year
                
                
                program myprog
                qui tab BValue
                gen ndistinct = r(r)
                end
                
                rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group)
                
                list, sepby(Country Loc Group)
                
                +---------------------------------------------------------------+
                | Year Loc Group BName Country id BValue ndisti~t |
                |---------------------------------------------------------------|
                1. | 2001 1 2 B Fr 1 B 1 |
                2. | 2003 1 2 C Fr 6 C 2 |
                3. | 2004 1 2 C Fr 10 C 1 |
                4. | 2005 1 2 B Fr 14 B 2 |
                |---------------------------------------------------------------|
                5. | 2001 2 2 C Fr 5 C 1 |
                6. | 2003 2 2 B Fr 8 B 2 |
                7. | 2004 2 2 B Fr 12 B 1 |
                8. | 2005 2 2 B Fr 16 B 1 |
                |---------------------------------------------------------------|
                9. | 2001 1 3 A UK 2 A 2 |
                10. | 2001 1 3 K UK 3 K 2 |
                11. | 2003 1 3 A UK 7 A 2 |
                12. | 2004 1 3 K UK 11 K 2 |
                13. | 2005 1 3 K UK 15 K 2 |
                |---------------------------------------------------------------|
                14. | 2001 2 1 A UK 4 A 1 |
                |---------------------------------------------------------------|
                15. | 2003 2 3 K UK 9 K 1 |
                16. | 2004 2 3 K UK 13 K 1 |
                +---------------------------------------------------------------+
                Hi Nick and Clyde!

                Thank you both very much for the fast response and detailed discussion! I'm still learning the codes Nick drafted, and it is really helpful. I just realised the might be some confusions in my descriptions, which I do apologise:

                1. In terms of "distinct BName", I might not manage to make a good dataset example, but I was just trying to say that if one BName/BValue turns up more than one times in the 3-year range, it should still count as 1 rather than the times it turns up. I previously counted the "peers" in one year, and learned the counting distinct value code from the FAQ that Nick wrote
                https://www.stata.com/support/faqs/d...stinct-values/
                So if I just generate _n by sorting Year, a same BValue may count in separate groups, which should not be the case if I waive my time window into three-year.

                2. My apologies on the year range description. As my previous idea was to count the peer number, save it as a separate file, and merge it again with major dataset by Year[_n-1], the count by (Year -2 0) is supposed to measures the peers in (Year -3 0). Nick's code actually improve my original thought so I don't have to re-merge the data.

                Thank you both again for the discussion! I will post again if I come across any further questions.


                Best,

                Melanie

                Comment


                • #9
                  First, I would like to say thanks to Clyde and Nick for all your help on the board and for the creating/writing the rangejoin/rangerun/rangestat commands.

                  Revisiting this post, I have a related question. Using the same example, is it possible to save the unique Bname/Bvalues in the previous 3 years (for my sake it doesn't matter if that means -2 0 or -3 -1).

                  For example, in year=2003, country="Fr", loc=2, group=2 there are 2 Bnames (B & C). These are obs 5 & 6 from the original data. I would like to save the two Bnames (B & C), maybe in the format

                  columns = year country loc group Bname
                  row1 = 2003 Fr 2 2 B
                  row2 = 2003 Fr 2 2 C

                  and so on.

                  Is this possible?

                  The current output is different. For example, after
                  Code:
                   
                   list, sepby(Country Loc Group)
                  Row 6 states 2 distinct Bname values (ndistinct=2), but there is only one such row (year=2003, loc=2, group=2, country="Fr", Bname="B"). The second distinct obs for Row 6 is in Row 5 (year=2001, loc=2, group=2, country="Fr", Bname="C"). I would like the second distinct obs to be output as (year=2003, loc=2, group=2, country="Fr", Bname="C"). The main difference is the year would change from 2001 to 2003 (but I would like to also retain the 2001 row (observation #5) as that is the 1 distinct obs for (year=2001, loc=2, group=2, country="Fr")). Or maybe there is a simpler way? But I am looking to save the distinct Bnames, not the number of distinct Bnames.

                  Thanks in advance.

                  Comment


                  • #10
                    First, no credit where credit is not due. I had no role in the development of -rangejoin-, -rangestat-, or -rangerun-, and Nick Cox contributed only to -rangestat- among those. The -rangejoin- and -rangerun- commands, although they build on -rangestat-, are solely authored by Robert Picard. (You may be confused because I did collaborate with Robert Picard on the development of -runby-, which does something rather different, and Nick has collaborated with Robert on a number of other related and unrelated commands.)

                    If I understand your question correctly, what you want is something like:

                    Code:
                    preserve
                    tempfile copy
                    keep Loc Group Country BName Year
                    save `copy'
                    restore
                    
                    rangejoin Year -3 -1 using `copy', by(Loc Group Country)
                    Finally, when talking about Stata data sets it is best to avoid the terms "row" and "column." That is spreadsheet terminology. Apart from a superficial resemblance of Stata data sets to spreadsheets when viewed in the Browser/Editor, they are very different objects. When you use the terms row and column, you encourage your brain to think about analyzing data in Stata the way you might in a spreadsheet. But those spreadsheet-honed instincts will generally be unhelpful, and sometimes lead to serious problems, when working in Stata. It is best to try to separate the two in your mind as much as possible: you will work better in both domains if you do so. What you have referred to as rows are, in Stata-speak, observations, and the analog of columns is variables.

                    Comment


                    • #11
                      I knew that Nick and Robert worked on those 3 commands. But since I was writing to both of you, I thought I could thank both he and you. I know you work on different things, I didn't mean to thank you for those commands, just to thank you both in general for what you do. All three of you (Robert, Nick and yourself) are a HUGE help to the rest of us. (There are others as well who I am not thanking here.)

                      I will try to think NOT in a spreadsheet way. The matrix way just seems intuitive to differentiate between observations (rows) and variables (columns). But I will try to do as you suggest.


                      Your code works well. Thank you. Small additional request: is there a way to put the first value in the same variable as the others. For example, if we add another obs and now have more than 2 distinct Bnames,

                      Code:
                      clear
                      input int Year float(Loc Group) str8 BName long Country float id
                      2001 1 2 "B" 1 1
                      2001 1 3 "A" 2 2
                      2001 1 3 "K" 2 3
                      2001 2 1 "A" 2 4
                      2001 2 2 "C" 1 5
                      2003 1 2 "W" 1 6
                      2003 1 2 "C" 1 7
                      2003 1 3 "A" 2 8
                      2003 2 2 "B" 1 9
                      2003 2 3 "K" 2 10
                      2004 1 2 "C" 1 11
                      2004 1 3 "K" 2 12
                      2004 2 2 "B" 1 13
                      2004 2 3 "K" 2 14
                      2005 1 2 "B" 1 15
                      2005 1 3 "K" 2 16
                      2005 2 2 "B" 1 17
                      end
                      label values Country Country
                      label def Country 1 "Fr", modify
                      label def Country 2 "UK", modify
                      
                      encode BName, gen(BValue)
                      
                      sort Country Loc Group Year
                      
                      
                      program myprog
                      qui tab BValue
                      gen ndistinct = r(r)
                      end
                      
                      rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group)
                      
                      list, sepby(Country Loc Group)
                      
                      preserve
                      tempfile copy
                      keep Loc Group Country BName Year
                      save `copy'
                      restore
                      
                      rangejoin Year -3 -1 using `copy', by(Loc Group Country)
                      The first unique obs shows in the "BName" variable while the second, third, etc unique BName values show up in "BName_U" variable. Is there a simple way to put all the unique values in the same variable?

                      In the output for the above code, could we insert an observation where year=2003, loc=1, group=2, country="Fr", and BName_U="W")?

                      (Note: I don't know if adding a new row (row # 6) into the above code was valuable? I just wanted to make sure it worked if there were more than 2 unique obs.)

                      One more question, should the timing in the code you wrote be the same as in the above rangerun command? Now rangerun has -2 0 and your rangejoin has -3 -1.

                      Thanks in advance.

                      Last edited by Kyle Smith; 04 Sep 2018, 10:06.

                      Comment


                      • #12
                        Yes, the -rangerun- and -rangejoin- should use the same intervals; my mistake.

                        I'm not completely clear on what you want the results to look like. The -rangejoin- command will find all the appropriate values of BName, some of which may appear more than once (say the same BName occurs in both 2001 and 2002 for matching to a 2003 observation). You don't say it in so many words, but I'm sensing that what you want ultimately is a list of distinct BNames all stacked into single variable. If so, you can do it this way:

                        Code:
                        clear*
                        input int Year float(Loc Group) str8 BName long Country float id
                        2001 1 2 "B" 1 1
                        2001 1 3 "A" 2 2
                        2001 1 3 "K" 2 3
                        2001 2 1 "A" 2 4
                        2001 2 2 "C" 1 5
                        2003 1 2 "W" 1 6
                        2003 1 2 "C" 1 7
                        2003 1 3 "A" 2 8
                        2003 2 2 "B" 1 9
                        2003 2 3 "K" 2 10
                        2004 1 2 "C" 1 11
                        2004 1 3 "K" 2 12
                        2004 2 2 "B" 1 13
                        2004 2 3 "K" 2 14
                        2005 1 2 "B" 1 15
                        2005 1 3 "K" 2 16
                        2005 2 2 "B" 1 17
                        end
                        label values Country Country
                        label def Country 1 "Fr", modify
                        label def Country 2 "UK", modify
                        
                        encode BName, gen(BValue)
                        
                        isid id
                        
                        sort Country Loc Group Year
                        
                        
                        
                        program myprog
                        qui tab BValue
                        gen ndistinct = r(r)
                        end
                        
                        rangerun myprog, use(BValue) inter(Year -2 0) by(Country Loc Group)
                        
                        
                        preserve
                        tempfile copy
                        keep Loc Group Country BName Year
                        save `copy'
                        restore
                        
                        rangejoin Year -2 0 using `copy', by(Loc Group Country)
                        
                        by id, sort: gen expander = cond(_n == 1 & !missing(BName_U), 2, 1)
                        expand expander
                        by id, sort: replace BName_U = BName if _n == 1
                        by id BName_U, sort: keep if _n == 1

                        Comment


                        • #13
                          Clyde,

                          "You da man." That's exactly what I was looking for. Thanks.

                          Comment

                          Working...
                          X