Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting occurrences of binary variable within large data set

    Hi,

    I am trying to determine the number of times maternity leave (matleave1) has been taken by individual (pidp) in a large data set and ideally I want a table showing me how many people have taken leave one, twice etc. The matleave1 variable takes the value of 1 if the individual is currently on maternity leave and 0 if they are not.

    I have the below code but I know there is an issue as I am getting a larger number of total maternity leaves that what I see as the original variable.

    by pidp, sort: summarize matleave1 if matleave1 ==1
    sort pidp
    bysort pidp: gen n_observations = sum(matleave1 == 1)
    tabulate n_observations

    Click image for larger version

Name:	n_observations.png
Views:	1
Size:	12.4 KB
ID:	1702199
    Click image for larger version

Name:	matleave1 .png
Views:	1
Size:	8.7 KB
ID:	1702200

  • #2
    My best guess is that you have data in long format (some have multiple rows and others have only one row). You need to summarize the number of maternal leaves of each person by using "bysort pidp: egen newvariable=total(matleave1)". Then drop duplicates to have one row per person. HTH.

    Comment


    • #3
      I think you need to tabulate by unique individuals and not the entire dataset:

      Code:
      egen total_leave=sum(matleave1), by(pidp)
      egen any_leave=max(matleave1), by(pidp)
      bysort pidp: gen ptseq=_n
      tab total_leave any_leave if ptseq==1

      Comment


      • #4
        #2 identifies much if not all of the problem but I think we need to see a data example to be confident of giving a good solution. For example, if the data are monthly then matleave will typically be 1 for some months for each pregnancy, and afterwards, or so I guess. .

        Either way, the point is to count leaves, not observations during leaves.
        Last edited by Nick Cox; 17 Feb 2023, 11:31.

        Comment


        • #5
          Clara:
          do you mean something along the following lines?
          Code:
          . use "https://www.stata-press.com/data/r17/nlswork.dta"
          (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
          
          . bysort idcode: egen wanted=sum( union)
          
          . bysort idcode: gen wanted2=wanted if _n==1
          
          . drop wanted
          
          . tab wanted2
          
              wanted2 |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    0 |      3,070       65.17       65.17
                    1 |        692       14.69       79.86
                    2 |        328        6.96       86.82
                    3 |        178        3.78       90.60
                    4 |        143        3.04       93.63
                    5 |        100        2.12       95.75
                    6 |         58        1.23       96.99
                    7 |         52        1.10       98.09
                    8 |         32        0.68       98.77
                    9 |         20        0.42       99.19
                   10 |         19        0.40       99.60
                   11 |         10        0.21       99.81
                   12 |          9        0.19      100.00
          ------------+-----------------------------------
                Total |      4,711      100.00
          
          .
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment

          Working...
          X