Comparing duration of a continuous variable during two time periods with partially paired data

Eduard López

Join Date: Dec 2014

Posts: 48
#1

Comparing duration of a continuous variable during two time periods with partially paired data

05 Jun 2023, 12:17

Hello everyone. I am using a big database of more than 3 million observations and around 60 variables. It contains longitudinal data from the early 80s to late 10s.
Subjects are not individual people, but rather each job contract, identified by a personal identification number. This is done so since sometimes the same person can have two or more job contracts at the same time (not really frequent, but also not unheard of). Each subject/contract has among others, two date variables: start date and end date of said contract.
I am interested in observing the difference in duration between contracts during two time periods: 2005 to 2010 (period A) and 2010 to 2015 (period B).

Using those two start and end date variables and the help of a kind user of the forum, I was able to create the duration variable for each job contract, and also two extra variables, duration for the job contract during period A (durA) and duration for the job contract during period B (durB). You can check the original question here: https://www.statalist.org/forums/for...=1685987788624
That means there are many missing data in those latter two variables (as not all the contracts span the duration of those two periods, for example a job contract contained in the database might have started in 1990 and ended in 1995, thus in durA and durB it would have only missing data, this would be scenario 1). Some contracts might have only values for durA (eg. a contract that begun in 2006 and ended in 2009) or for durB (eg. a contract that lasted from 2011 to 2013; this would be scenario 2). And finally (scenario 3), some contracts might have data for both durA and durB (a contract that began in 2000 and lasted until 2016 would have the maximum value for both durA and durB).

As far as I know, I can't perform a t-test on those two variables, as the data are only partially paired (those who fit into scenario 3). Many contracts only have values for either durA or durB. I've read a few papers on techniques to make comparisons involving weighted t-tests with partially paired data (eg. https://www.tqmp.org/RegularArticles.../p055/p055.pdf, https://digitalcommons.wayne.edu/cgi...&context=jmasm, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9041716/, https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf), but there is no explanation on how to implement them on Stata (one of the papers mentions a package for R, though).

Would Welch's t-test work in this case? Or is there a way to perform the analysis using the main duration variable (the one that includes the duration of all contracts) but comparing the duration of only those two periods (that is, the values included between on one hand 2005-2010 and 2010-2015 on the other).

Thank you so much for your help.
Tags: None

Eduard López

Join Date: Dec 2014
Posts: 48

06 Jun 2023, 18:23

I'm posting two examples of durA and durB (called dcomant and dcomdesp here), so that you can see how many missing values are there:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float dcomant
   .
   .
   .
   .
   .
   .
1825
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
1613
 150
  60
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
 586
 422
 367
   .
   .
  35
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
 347
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
1825
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
end

And dcomdesp:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float dcomdesp
   .
  77
  19
   .
   .
   .
1825
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
 120
   .
   .
1307
 120
   .
   .
   .
   .
   .
   .
   .
 264
  31
   .
   .
   .
   .
   .
  11
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
  32
 249
  44
 919
   .
   .
   .
   .
   .
   .
   .
 319
  24
  68
  42
   4
  18
  80
   .
   .
   .
   .
1825
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
   .
end

As you can notice, there are a lot of missing values, unlike dcom, which is the variable of duration for episodes for all periods:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float dcom
  230
   77
   19
  182
  465
  267
36646
  789
  699
  651
  867
  947
  213
  184
   10
   58
   31
   91
 1732
 1613
 3892
 1368
 1759
  355
 2707
  244
  151
 1075
  291
   58
  264
   31
   94
   67
   69
   77
   60
   11
  183
  182
 1005
  151
    4
  159
  202
  696
   43
  696
  422
  367
  160
  208
   35
   32
  249
   44
31965
   89
   20
   38
   57
   56
   42
  347
  319
   24
   68
   42
    4
   18
   80
    .
    .
  424
 2224
37985
   91
  306
   90
  294
 1262
   21
  168
  614
  272
   90
  298
   41
 1198
  151
 2084
  115
  292
  636
    .
  359
  186
29402
   89
 1064
end

The problem is that I want to compare the duration for episodes only in period A and B, so I don't need all of the values in dcom. But if I only use dcomant and dcomdesp, they show data that is neither fully paired nor fully unpaired. I don't know how to perform an optimal pooled t-test in Stata, but perhaps I'm looking in the wrong direction and there is a different technique I should be using?

Thank you for your help in advance.

Announcement

Comparing duration of a continuous variable during two time periods with partially paired data

Comment