How to impute this missing variable?

Doris Rivera

Join Date: Feb 2020

Posts: 172
#1

How to impute this missing variable?

10 Oct 2021, 07:25

Dear Statalist, I would like to ask some advice about how to impute correctly the end date of a contract, which is around 50% of missing in my dataset. My dataset is a longitudinal one, in which each line is a contract for each person for which I have the start and end dates of the contract (see an example below). What I think to do is to use the average of the end date from those with no missing (by type of contract) to fill in the missing adding this average to the start date. My goal is to know if a person is working at a certain date of the year. However, I am not pretty sure this is a correct way to impute the date. I have seen that there are other more complex methodologies like multiple imputation (MI).

1) Do you think it has sense to impute the end date of the contract with MI (or this MI method is only for a specific type of variable)?
2) In case of using this MI method, I should account the structure of the dataset which is a kind of multilevel?
3) Do I need that the explanatory variables “always” be non-missing?
4) This imputed information will be used first in a descriptive analysis and thereafter in a regression. Is it possible to say to Stata that the imputed day cannot be below the start date of the contract?
5) Is it possible to enrich the MI model with variables not varying for the person (ex. gender)?
6) Is there a test or procedure to do after the imputation to test how good it is? I mean, something like to impute some groups of end dates with no missing in the data and then compare the imputation with the real date? And how to do it?
7) Do you think that using the average mentioned before might be a good approach for an academic research?

Thanks for the help, and sorry for such number of questions!

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id start end) 1 14575 14695 1 14625 14727 1 14656 . 1 15412 15485 1 15414 15732 1 15661 . 1 15907 15979 1 15911 15983 1 15913 15985 1 16147 . 1 16184 16265 1 16196 . 1 16197 16269 1 16239 16311 1 16245 16318 1 16253 16325 1 16280 16352 1 16364 16440 1 16406 16480 1 16409 . 1 16665 . 1 17444 . 1 17669 17861 1 17860 . 1 17950 . 1 18249 18501 1 18461 18897 1 18765 18908 1 18826 19263 1 19192 19628 1 19557 19993 1 19922 20358 1 20287 20724 1 20653 21089 1 21018 21454 1 21080 21454 1 21383 21819 1 21748 22185 2 14991 . 2 15006 . 2 15856 15958 2 15887 . 2 16257 16353 2 16361 16468 2 16535 16736 2 16667 . 2 16908 17101 2 17035 . 2 17075 17558 2 17487 . 3 13729 . 3 13925 . 3 14290 . 3 14961 . 3 14968 15040 3 14970 15042 3 14973 15225 3 15023 15302 3 15374 . 3 16933 17132 3 17738 17851 3 17920 18175 3 17920 18175 3 17983 18173 3 19973 20226 3 20174 . 3 20489 . 3 20598 . 4 15123 . 4 15382 15487 4 15493 15571 4 15500 15579 4 15508 15611 4 15767 . 4 15821 . 4 15923 16097 4 16026 16279 4 16574 16653 4 16583 16700 4 16938 17044 4 16973 17055 4 16984 17068 4 17549 . 4 17675 . 4 17949 . 4 18006 18260 4 18189 . 5 16205 16368 5 16583 16706 5 16695 17029 5 16956 17242 5 17393 . 5 17395 17649 5 17430 17683 5 17577 . 5 17613 17868 5 17798 . 5 17902 18082 5 18012 18180 5 18131 18567 end format %td start format %td end
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3860
#2

10 Oct 2021, 09:52

Originally posted by Doris Rivera View Post

[...] the end date of a contract, which is around 50% of missing in my dataset.

Why? What do you believe (or know) about the reasons for the end-date being missing?
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#3

10 Oct 2021, 12:04

Dear Daniel, thanks for your help. My feeling is that it is random, since this happens for almost all the categories of contracts. Basically the employer does not have the obligation of reporting the end date. However, I need to know if in a given date the person is working or not, which will be used then in a descriptive and regression analyses The point is that I am not sure if using the averages is good approach in general to impute missing or if I need to use multiple imputation methods. Your advice would be very helpful.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

10 Oct 2021, 13:04

I would not attempt to impute end dates.

You plan on creating an indicator variable for whether a person is working or not working on a given date. I would consider creating an indicator variable for whether a person is known to be working (within some start/end pair, or on the start date of a pair with a missing end date) or not known to be working on a given date.
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#5

10 Oct 2021, 13:41

Dear William, thanks for your answer. Can you elaborate a little bit why do you think it is not a good idea to impute the end dates?
If I understood well, you advice not to use the missing (only the info that I have in the dataset). But this would underestimate a lot the time people are working, or even if they are working in a given date. Especifically in those contracts that are not temporarily.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#6

10 Oct 2021, 13:50

Originally posted by Doris Rivera View Post

My feeling is that it is random, since this happens for almost all the categories of contracts. Basically the employer does not have the obligation of reporting the end date.

Whenever missing values occur as a result of a choice of reporting or not reporting, missing values are unlikely to occur (completely) at random. Whether that is a problem, depends on the specific research questions.

Originally posted by Doris Rivera View Post

However, I need to know if in a given date the person is working or not

Well, there is no way you can know that. None of the imputation techniques mentioned (nor any other that I am aware of) aims at producing "correct" imputed values. All we can hope for is to recover the true associations between variables (with missing values), ideally accounting for the uncertainty introduced by the imputation (i.e., by guessing the underlying values that we have not observed).

It is hard to give more specific advice without knowing more about the research questions. In general, I tend to agree with William in that imputing dates is probably posing more problems than it solves. I might consider creating the intended indicator of whether a person is working or not, including missing values, then impute those missing values based on other covariates.

Edit: Crossed with #5.

Originally posted by Doris Rivera View Post

why do you think it is not a good idea to impute the end dates?

It is not a good idea because it might be pretty hard to find an appropriate model that predicts a date. It might be easier to predict the duration, especially given that the starting date appears to be fully observed. However, the date itself appears to be a means to an end; you want to get at employment status. The latter is a binary variable, which might be easier to model/predict.

Last edited by daniel klein; 10 Oct 2021, 13:57.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#7

10 Oct 2021, 14:03

Originally posted by Doris Rivera View Post

Especifically in those contracts that are not temporarily.

Well, if a contract is permanent, then the missing end date might truly be missing. There is no end date in that case. Imputation methods assume that missing values "mask" the "true" but unobserved values. If there are no "true" values, then you should not try to impute them.
1 like
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#8

11 Oct 2021, 01:02

Dear Daniel, I think things are even harder than I thought. The primary objective is to do a descriptive analysis about if the person is working at a certain date (let say in the summer and in the winter). Which will study then the characteristics of those works. So, as you see, unfortunately for me I need to know if a person is working and for that I asume that I need to know the end date. However, you mentioned that it would be easier to impute if a person is working (dummy variable). But correct me if I am wrong, but, don`t you need the end date for that? I mean, for knowing if the person is working for instance in a given day in summer, I would need to know the duration of a contract that started let say in January, for which I would need an estimation of the end date.
Can you tell me if there is any rule that clarifies which type of variables can be imputed and which cannot? Does it depends on the number of missing you have in the dataset? (Mine is huge)
Finally, I asume then that using the average as I mentioned earlier is even worse that using a MI in this case, right?
Thanks a lot for your help with this!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#9

11 Oct 2021, 02:23

Originally posted by Doris Rivera View Post

The primary objective is to do a descriptive analysis about if the person is working at a certain date (let say in the summer and in the winter).

So whether a person is working at a specific date is determined by whether or not the person has an ongoing working contract for that date? And, you can identify whether a contract is permanent? If this is so, then you "know" that persons with permanent contracts are working from the starting date of their contract. Missing values in the end date of permanent contracts are, then, irrelevant.

Originally posted by Doris Rivera View Post

Which will study then the characteristics of those works.

This step-wise logic is not going to work for you. Wanting to examine the characteristics of persons who work strongly implies that these characteristics differ from those of persons who do not work. Thus, these characteristics will predict (in a correlational sense) whether a person works or not. This is valuable information that should (I dare say must) be acknowledged when you impute missing values.

Originally posted by Doris Rivera View Post

I mean, for knowing if the person is working for instance in a given day in summer, I would need to know the duration of a contract that started let say in January, for which I would need an estimation of the end date.

It seems to me that the end date of a contract is, for your purposes, essentially the very same as the working status of a person. If this holds true, I see little virtue in going for the end date instead of what you are after directly, namely the working status.

Originally posted by Doris Rivera View Post

Can you tell me if there is any rule that clarifies which type of variables can be imputed and which cannot?

In which sense? Technically, you can probably impute any kind of missing values. Theoretically and generally speaking, missing values should not be imputed if there is no underlying "true" value. When there is an underlying "true" value that just happens to be not observed, you might consider imputing it. How to do that depends on many things; perhaps most importantly, whether you can make a reasonable case for the missing values being missing at random (conditional on the observed values and covariates).

Originally posted by Doris Rivera View Post

Does it depends on the number of missing you have in the dataset? (Mine is huge)

Some people would say, you cannot impute missing values when the proportion of missing values is above X. I do not believe that such arbitrary cut-off values or rules of thumb are particularly useful. The question that you should be asking is: What is the best way to approximate the answer to my research question? Sometimes, the honest answer might be that giving up is better than producing potentially misleading results; but I believe these cases are rare.

Originally posted by Doris Rivera View Post

Finally, I asume then that using the average as I mentioned earlier is even worse that using a MI in this case, right?

Well, using the mean of one variable is a very simple model. On the topic of single vs multiple imputation: when you impute only one value, you are essentially ignoring any uncertainty that is associated with your guess of that value. Doing so will underestimate the variance and, hence, provide a false sense of accuracy.
1 like
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#10

11 Oct 2021, 03:53

Dear Daniel, after reading your answers I am aware of the difficulties of imputing the end date. Lets say then, that what I can do is to impute if the person is working (even though I do not see the difference between imputing if the person is working or imputing the end date, since in the dataset you only see people that has a contract and therefore working). The point is how to know the time they are truly working.
Can you suggest me how to proceed in this case with the data structure I show above?
Is it possible to do any test to check for the goodness of fit of the imputation?

Thanks for your help.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#11

11 Oct 2021, 04:26

Originally posted by Doris Rivera View Post

even though I do not see the difference between imputing if the person is working or imputing the end date

That is precisely my point. For your purposes, there does not seem to be a difference. The question is then, what and how to impute. To impute the working status, I would start with a simple logit (or probit) model as a natural choice. Which model do you use to impute a date? What is the measurement level of date? Is it continuous? Is it categorical? How are dates distributed? You need reasonable answers to such questions before imputing dates. I do not have them readily available. As pointed out, a duration might be simpler yet still looks more complicated than a binary variable to me.

Originally posted by Doris Rivera View Post

since in the dataset you only see people that has a contract and therefore working

I think there is a misunderstanding here. I suggest imputing whether a person is working or not at the specific date you are interested in. As stated above, you cannot first impute status, then go on with the selected cases; you need to incorporate all information in the imputation process. Thus, I suggest defining the date you are interested in first, then impute the missing working status at that date.

Originally posted by Doris Rivera View Post

The point is how to know the time they are truly working.

As pointed out, there is no voodoo magic in imputation that can give you the certainty that you are looking for. You will have to make reasonable assumptions about the distribution of the true values that are not observed, given the observed values (and other covariates) and then draw values from this distribution.

Originally posted by Doris Rivera View Post

Is it possible to do any test to check for the goodness of fit of the imputation?

Not in the sense that you are probably hoping for. This is related to my answer above: you cannot know and or test what you have not observed.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#12

11 Oct 2021, 06:22

Let me add a couple of brief answers to some of your initial questions.

Originally posted by Doris Rivera View Post

2) In case of using this MI method, I should account the structure of the dataset which is a kind of multilevel?

Yes. Even if you go for another imputation method you should probably exploit the fact that a persons' employment history is probably a pretty good predictor for their employment at any given point in time.

Originally posted by Doris Rivera View Post

3) Do I need that the explanatory variables “always” be non-missing?

No. You can and should impute missing values in all variables simultaneously. You also can and should include non-missing variables as predictors.

Originally posted by Doris Rivera View Post

4) This imputed information will be used first in a descriptive analysis and thereafter in a regression. Is it possible to say to Stata that the imputed day cannot be below the start date of the contract?

As imputing the date might not be a good idea, this question might be more generally understood as: Can I put restrictions on the range of imputed values? The answer is: Yes, but it is not clear whether that might have negative side-effects: remember: multiple imputation is not about getting the "correct" unobserved values on the unit-level but about getting unbiased associations between variables (and account for the uncertainty of the imputations).

Originally posted by Doris Rivera View Post

5) Is it possible to enrich the MI model with variables not varying for the person (ex. gender)?

Yes. It is also advisable, especially if these variables predict missingness and/or outcome.

Originally posted by Doris Rivera View Post

6) Is there a test or procedure to do after the imputation to test how good it is? I mean, something like to impute some groups of end dates with no missing in the data and then compare the imputation with the real date? And how to do it?

Imputing missing values when you know the true values might sound like a reasonable robustness check. Unfortunately, this will at best tell you how well your imputation model deals with the mechanism that you have chosen to decide which observed values should be treated as missing and, thus, be imputed. It will not tell you anything about how well that model deals with the mechanism that led to the original missing values that are missing for reasons unknown to you.

Last edited by daniel klein; 11 Oct 2021, 06:24.
1 like
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#13

11 Oct 2021, 11:43

Dear Daniel, thanks a lot for all of your answers. You are right, I missunderstood about your suggestion of imputing if the person is working (at a given date). Sorry but all this is new and difficult to me. If I understand now correctly, what you suggest is to impute if a person is working (dichotomic variable), and then look for the contract (observation) from that imputation for studying the characteristics of that specific firm (which is my aim), right?
The idea is then to know if the person is working in two dates of the year for all the years starting from the first contract (one in summer and the other in winter). So you may find older people from old cohorts with long working life and some others with only very few or less than a year (from more recent cohorts). My first question would be if I need to do the MI process each time for each date I want to know if a person is working. Or on the contrary, I can automate the MI process for doing the imputation for the two periods of the year (a day in sumer and a day in winter) for all the years (ex. first, second, third,...). Notice that the first year of a person in the first cohort is not the same year than the first year for a person in the last cohort.
Can you help me how to start with this, or suggest me a document to start with? Even though I think I understand the logic behind your suggestion, I do no know how to proceed in Stata (for instance which would be my dependent variable? or how to include the longitudinal structure of the dataset?).
Thanks again!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#14

11 Oct 2021, 12:16

Originally posted by Doris Rivera View Post

and then look for the contract (observation) from that imputation for studying the characteristics of that specific firm (which is my aim), right?

So you are studying firms, not individuals? And, you have a sample of contracts? And, it appears to be somehow important that people (who hold those contracts?) were born in different cohorts? And, for whatever reason, you want to look at (the differences in?) summer and winter? Sorry, this seems quite complex, and it is hard to give specific advice for a complex scenario I have little information about. If you need more specific advice, you probably need to provide more specific information. Start with the specific research question(s) that you are trying to answer. Then describe the dataset that you are going to use to answer the research question(s). Then describe your analytic approach to answer your research question. In all of these descriptions, ignore the missing data. If we get a clear picture of what you want to achieve, and in which way you want to achieve it, we can think about how to best deal with missing values in your specific scenario. By we, I mean me or others (on Statalist or, preferably, on-site/your institution) as this might not be my area of expertise, and it seems to be taking much more time than I am willing or able to invest here.

I would recommend that you start reading through the manual entry [MI]. This FAQ might also help.

Last edited by daniel klein; 11 Oct 2021, 12:19.
3 likes
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#15

11 Oct 2021, 14:58

Dear Daniel, I totally understand. Sorry for the mess. At least I now know what it is possible and not possible to do. In brief, it is from labor economics, and the dataset is the working life of some people coming from different cohorts for which I want to study first in a descriptive analisis the characteristics of those people and firms in which people are working at a specific day in summer and in winter. Reason why I need to aproximate if they are working in those days. The research question is to study in a descriptive analysis possible differences in the type of firms and people depending on the date of the year they are working. Let say I want to know if a given person is working in a specific day in summer/winter, I need this for all the summers/winters in his/her working life. Imagine I see an individual who start in the dataset in january of 1990, I would like to know if in a given day in the summer and winter of 1990, 1991, 1992... he/she is working. Then, I can look for his/her characteristics as well as for the firm. However, as this data set is not a panel, I cannot fix an especific date for everyone. What I can do is to acknowledge that the first summer for someone from an old cohort (1990) cannot be the first summer from someone from a recent cohort (2010). This structure of data is something that can be seen from the very first contract in those 5 individuals I report in #1. Therefore, I can for instance build a graph of the percentaje of people working in the first summer (in which this first summer would be in 1990 for the people in that cohort but would be the summer of 2010 for the people in the cohort of 2010) for a large firm depending on the cohort; I can also check the percentaje of male or female...
I really hope this to be more clear for you and others. I am very sorry for all the mess. Anyway, I think it is clear now that it is not advisable to impute the end date, but also there might be another solution (even though I do not know how to proceed). I will start by the document you suggest, and I hope this help me to clarify how to proceed.
Thanks again.
Comment

Announcement

How to impute this missing variable?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment