Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing duration of a continuous variable during two time periods with partially paired data

    Hello everyone. I am using a big database of more than 3 million observations and around 60 variables. It contains longitudinal data from the early 80s to late 10s.
    Subjects are not individual people, but rather each job contract, identified by a personal identification number. This is done so since sometimes the same person can have two or more job contracts at the same time (not really frequent, but also not unheard of). Each subject/contract has among others, two date variables: start date and end date of said contract.
    I am interested in observing the difference in duration between contracts during two time periods: 2005 to 2010 (period A) and 2010 to 2015 (period B).

    Using those two start and end date variables and the help of a kind user of the forum, I was able to create the duration variable for each job contract, and also two extra variables, duration for the job contract during period A (durA) and duration for the job contract during period B (durB). You can check the original question here: https://www.statalist.org/forums/for...=1685987788624
    That means there are many missing data in those latter two variables (as not all the contracts span the duration of those two periods, for example a job contract contained in the database might have started in 1990 and ended in 1995, thus in durA and durB it would have only missing data, this would be scenario 1). Some contracts might have only values for durA (eg. a contract that begun in 2006 and ended in 2009) or for durB (eg. a contract that lasted from 2011 to 2013; this would be scenario 2). And finally (scenario 3), some contracts might have data for both durA and durB (a contract that began in 2000 and lasted until 2016 would have the maximum value for both durA and durB).

    As far as I know, I can't perform a t-test on those two variables, as the data are only partially paired (those who fit into scenario 3). Many contracts only have values for either durA or durB. I've read a few papers on techniques to make comparisons involving weighted t-tests with partially paired data (eg. https://www.tqmp.org/RegularArticles.../p055/p055.pdf, https://digitalcommons.wayne.edu/cgi...&context=jmasm, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9041716/, https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf), but there is no explanation on how to implement them on Stata (one of the papers mentions a package for R, though).

    Would Welch's t-test work in this case? Or is there a way to perform the analysis using the main duration variable (the one that includes the duration of all contracts) but comparing the duration of only those two periods (that is, the values included between on one hand 2005-2010 and 2010-2015 on the other).

    Thank you so much for your help.

  • #2
    I'm posting two examples of durA and durB (called dcomant and dcomdesp here), so that you can see how many missing values are there:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float dcomant
       .
       .
       .
       .
       .
       .
    1825
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
    1613
     150
      60
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
     586
     422
     367
       .
       .
      35
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
     347
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
    1825
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
    end


    And dcomdesp:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float dcomdesp
       .
      77
      19
       .
       .
       .
    1825
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
     120
       .
       .
    1307
     120
       .
       .
       .
       .
       .
       .
       .
     264
      31
       .
       .
       .
       .
       .
      11
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
      32
     249
      44
     919
       .
       .
       .
       .
       .
       .
       .
     319
      24
      68
      42
       4
      18
      80
       .
       .
       .
       .
    1825
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
    end


    As you can notice, there are a lot of missing values, unlike dcom, which is the variable of duration for episodes for all periods:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float dcom
      230
       77
       19
      182
      465
      267
    36646
      789
      699
      651
      867
      947
      213
      184
       10
       58
       31
       91
     1732
     1613
     3892
     1368
     1759
      355
     2707
      244
      151
     1075
      291
       58
      264
       31
       94
       67
       69
       77
       60
       11
      183
      182
     1005
      151
        4
      159
      202
      696
       43
      696
      422
      367
      160
      208
       35
       32
      249
       44
    31965
       89
       20
       38
       57
       56
       42
      347
      319
       24
       68
       42
        4
       18
       80
        .
        .
      424
     2224
    37985
       91
      306
       90
      294
     1262
       21
      168
      614
      272
       90
      298
       41
     1198
      151
     2084
      115
      292
      636
        .
      359
      186
    29402
       89
     1064
    end


    The problem is that I want to compare the duration for episodes only in period A and B, so I don't need all of the values in dcom. But if I only use dcomant and dcomdesp, they show data that is neither fully paired nor fully unpaired. I don't know how to perform an optimal pooled t-test in Stata, but perhaps I'm looking in the wrong direction and there is a different technique I should be using?

    Thank you for your help in advance.

    Comment

    Working...
    X