Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying how many times Distinct Observations occur within another variable.

    Hi,

    I am using panel data from 2008-2019 and I am trying to identify the number of participants who have been in unemployed at any given time as well as the length they have been unemployed for. I have been able to generate the distinct observations relatively simply but I am struggling with the second part of this hurdle ie(linking the distinct observations to the unemployment status since this isn't always a static response and changes over time).

    I have tried to variation of the code below based on some of the previous queries on Statalist that tried to identify distinct observation based on a second characteristic but these efforts haven't been successful.

    Code:
    bysort ws089_ nomem_encr: gen count = _n == 1
    by ws089_: replace count = sum(count)
    by ws089_: replace count = count[_N]
    I have gone through the following column, but don't believe it covers this specific query:
    http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

    Example code below
    I have been able to relatively simply generate a variable which provides me with the number of Distinct observation using the code:
    Code:
    by nomem_encr, sort: gen nvals = _n == 1
    Dataex code containing 4 variables, unique identifier, year of data (I only included 3 years), ws089_ which is reported unemployment dummy variable 1=unemployed and nvals which is the distinct observation variable generated above.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double nomem_encr int Year double ws089_ float nvals
    800009 2008 . 1
    800009 2009 . 0
    800009 2010 . 0
    800015 2008 . 1
    800015 2009 . 0
    800015 2010 0 0
    800018 2008 . 1
    800018 2009 . 0
    800018 2010 . 0
    800033 2008 0 1
    800033 2009 0 0
    800033 2010 0 0
    800042 2008 0 1
    800042 2009 0 0
    800042 2010 0 0
    800054 2008 . 1
    800054 2009 . 0
    800054 2010 . 0
    800057 2008 0 1
    800057 2009 0 0
    800057 2010 0 0
    800076 2008 . 1
    800076 2009 0 0
    800076 2010 . 0
    800085 2008 . 1
    800085 2009 . 0
    800085 2010 . 0
    800100 2008 . 1
    800100 2009 . 0
    800100 2010 . 0
    800119 2008 1 1
    800119 2009 1 0
    800119 2010 1 0
    800125 2008 1 1
    800125 2009 0 0
    800125 2010 0 0
    800131 2008 . 1
    800131 2009 . 0
    800131 2010 0 0
    800134 2008 . 1
    800134 2009 . 0
    800134 2010 . 0
    800155 2008 0 1
    800155 2009 . 0
    800155 2010 . 0
    800158 2008 0 1
    800158 2009 . 0
    800158 2010 . 0
    800161 2008 . 1
    800161 2009 . 0
    end
    label values ws089_ cw11d089
    label def cw11d089 0 "no", modify
    label def cw11d089 1 "yes", modify

  • #2
    Thanks for the data example.

    Let's get the terminology straight. In Stata an observation is a row, case or record in the dataset. Your problem is about counting distinct values.

    The code you give is equivalent to that discussed on p.563 of https://www.stata-journal.com/sjpdf....iclenum=dm0042 from which one might write

    Code:
    . egen tag = tag(ws089_ nomem_encr)
    
    . egen distinct = total(tag), by(ws089_)
    
    . 
    . tabdisp ws089_ , c(distinct)
    
    ----------------------
       ws089_ |   distinct
    ----------+-----------
           no |          9
          yes |          2
            . |          0
    ----------------------
    Now

    Code:
    egen tag = tag(ws089_ nomem_encr)
    is equivalent to

    Code:
    bysort ws089_ nomem_encr: gen count = _n == 1
    and indeed the same idea is the core of
    egen, tag() and has been since 1999 as in https://www.stata.com/products/stb/journals/stb50.pdf Similarly,
    Code:
       
     by ws089_: replace count = sum(count)  by ws089_: replace count = count[_N]
    is equivalent to
    Code:
    egen distinct = total(tag), by(ws089_)

    So, I don't understand what the issue is. You've re-discovered or re-invented the method in dm0042-- which in turn is just standard Stata idioms. That fine. Indeed, although the code in dm0042 is shorter, yours would be faster to execute. Directly put in terms of your data example, only two distinct people have ever been unemployed, namely
    800119 800125 -- which is what the code above reports.

    Comment

    Working...
    X