Counting occurrences of binary variable within large data set

Clara Nelson

Join Date: Feb 2023

Posts: 6
#1

Counting occurrences of binary variable within large data set

17 Feb 2023, 10:22

Hi,

I am trying to determine the number of times maternity leave (matleave1) has been taken by individual (pidp) in a large data set and ideally I want a table showing me how many people have taken leave one, twice etc. The matleave1 variable takes the value of 1 if the individual is currently on maternity leave and 0 if they are not.

I have the below code but I know there is an issue as I am getting a larger number of total maternity leaves that what I see as the original variable.

by pidp, sort: summarize matleave1 if matleave1 ==1
sort pidp
bysort pidp: gen n_observations = sum(matleave1 == 1)
tabulate n_observations
Tags: None
Sawaeng Watcharathanakij

Join Date: Nov 2016

Posts: 13
#2

17 Feb 2023, 10:40

My best guess is that you have data in long format (some have multiple rows and others have only one row). You need to summarize the number of maternal leaves of each person by using "bysort pidp: egen newvariable=total(matleave1)". Then drop duplicates to have one row per person. HTH.
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#3

17 Feb 2023, 10:45

I think you need to tabulate by unique individuals and not the entire dataset:

Code:

egen total_leave=sum(matleave1), by(pidp) egen any_leave=max(matleave1), by(pidp) bysort pidp: gen ptseq=_n tab total_leave any_leave if ptseq==1
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35650
#4

17 Feb 2023, 10:48

#2 identifies much if not all of the problem but I think we need to see a data example to be confident of giving a good solution. For example, if the data are monthly then matleave will typically be 1 for some months for each pregnancy, and afterwards, or so I guess. .

Either way, the point is to count leaves, not observations during leaves.

Last edited by Nick Cox; 17 Feb 2023, 11:31.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17703

17 Feb 2023, 10:50

Clara:
do you mean something along the following lines?

Code:

. use "https://www.stata-press.com/data/r17/nlswork.dta"
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. bysort idcode: egen wanted=sum( union)

. bysort idcode: gen wanted2=wanted if _n==1

. drop wanted

. tab wanted2

    wanted2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      3,070       65.17       65.17
          1 |        692       14.69       79.86
          2 |        328        6.96       86.82
          3 |        178        3.78       90.60
          4 |        143        3.04       93.63
          5 |        100        2.12       95.75
          6 |         58        1.23       96.99
          7 |         52        1.10       98.09
          8 |         32        0.68       98.77
          9 |         20        0.42       99.19
         10 |         19        0.40       99.60
         11 |         10        0.21       99.81
         12 |          9        0.19      100.00
------------+-----------------------------------
      Total |      4,711      100.00

.

Kind regards,
Carlo
(Stata 19.0)

Announcement

Counting occurrences of binary variable within large data set

Comment

Comment

Comment

Comment