Best practices for imputing missing longitudinal data: mi, mipolate, MLMM, or others?

Maria Sundaram

Join Date: Jun 2016

Posts: 45
#1

Best practices for imputing missing longitudinal data: mi, mipolate, MLMM, or others?

30 Aug 2024, 07:53

Dear Statalist,

I have a dataset containing information on SARS-CoV-2 gene concentration in wastewater over time. The data are sampled from the same wastewater treatment plant approximately twice each week from beginning 2022 to end of 2023. I am interested in 'filling in the gaps' for SARS-CoV-2 concentration (variables ln_pmmov_mm_dpcr and ln_n1n2_avg_mm_dpcr) on days when the wastewater was not sampled. I am hoping not to have to tell Stata up front which type of distribution these 'missing' values should come from, as there is substantial variability in the values over time. It would be instead reasonable to ask Stata to make a guess for the missing values based mostly on values close by in time. I am running StataNow/BE 18.5.

At first, I thought it might be nice to use multiple imputation with chained equations. I struggled with this approach: it seemed that the most reasonable way for me to implement this approach was to reshape my long dataset to wide. Since I had >600 observations for one wastewater treatment plant (i.e., one "individual"), a reshape to wide would have resulted in a massively wide dataset with one row. This seemed infeasible (and actually for me, running Stata BE, it is impossible to have that many variables in the dataset). However, I would love to be proven wrong about this, and it's reasonable I would be wrong about it as I am really unfamiliar with the -mi- suite of commands and procedures. I have some halfhearted code below that uses -flong- rather than a wide data structure. (I am sure it is not correct, and only runs when there is only one predictor in the model.)

The -mipolate- command (written by Nick Cox) seemed like a nice way around this "massively wide dataset" issue. However, it seems that -mipolate- requires an xvar predictor of yvar. I have several 'candidate' predictors (e.g., pH of the wastewater sample, flow rate at time of sampling, etc.), but understand that -mipolate- by itself is agnostic to panel structure. Would it be reasonable, then, to tell -mipolate- that the best predictor in my case is the sample collection date? In this case, I would want to use the -idw- option for -mipolate-, as long as I can somehow let -mipolate- know that the data are changing over time (and/or that time is an important factor in deciding what the interpolated value should be).

Finally, I have seen in other circumstances that sometimes people use multiple linear mixed models with random intercepts to solve problems such as these. I would prefer to try to avoid this if possible, because I think that would require me telling Stata that I believe all the missing values come from the same distribution (when I am not certain that they do). However, maybe someone more talented in statistics would be able to confirm whether that seems like a large or a trivial problem.

An example of my dataset is provided below:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int sample_collect_date double(ph_mm flow_rate_mm) float(ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr) 22651 7.67 33.65 19.72869 14.237268 22653 7.55 33.33 19.78993 14.2145 22654 7.69 33.82 19.72 13.95687 22655 7.9 34.15 19.96952 13.528894 22656 7.68 34.56 19.29398 13.24352 22657 7.61 34.4 18.551708 13.530492 22658 7.59 35.31 18.97065 12.838408 22660 7.56 32.83 19.286436 13.168484 22661 7.62 33.42 19.19116 13.094458 22662 7.74 34.69 19.715847 13.27106 22663 7.63 34.36 19.61545 12.855686 22664 7.62 33.84 19.49199 12.944575 22665 7.57 34.04 19.5535 14.668052 22667 7.54 34.13 19.59835 15.28725 22668 7.65 33.83 19.72 13.982515 22669 7.76 35.07 19.625214 13.568893 22671 7.83 35.55 19.58283 14.082418 22672 7.68 35.31 19.508703 13.549563 22674 7.69 34.58 19.48801 12.253338 22675 7.73 35.19 19.58358 12.79608 22676 7.73 35.58 19.719017 12.006645 22677 7.66 35.16 19.67184 12.659946 22678 7.85 35.3 19.634644 12.654342 22679 7.78 34.62 18.995495 15.99823 22681 7.7 34.26 19.49582 13.887273 22682 7.75 35.16 19.35253 12.862556 22683 7.87 35.34 19.612177 13.757033 22684 7.85 35.3 19.308737 12.948771 22685 7.72 34.98 19.44844 13.318457 22686 7.8 34.54 19.26173 12.133932 22688 7.69 33.94 19.381716 12.11571 22689 7.7 34.39 19.436636 12.346183 22690 7.82 35.16 19.55105 12.988695 22691 7.7 34.66 19.455856 11.99387 22692 7.66 34.51 19.613997 12.609937 22693 7.71 34.48 19.606083 13.275462 22695 7.72 33.82 20.38286 13.681842 22696 7.74 35.39 19.69235 12.759498 22697 7.71 34.68 20.02241 11.7966 22698 7.64 33.6 19.587824 12.673573 22699 7.79 34.46 19.885324 13.510506 22700 7.63 34.03 19.342875 13.118195 22702 7.69 33.97 19.707045 12.13221 22703 7.63 35.03 19.8844 11.11543 22704 7.79 35.62 19.67812 11.313498 22705 7.61 35.96 19.644926 11.739583 22706 7.64 35.75 19.87014 12.161912 22707 7.69 35.49 19.73376 12.771386 22709 7.55 37.03 19.654406 14.087673 22710 7.64 39.78 19.522354 13.001777 end format %td sample_collect_date

Here is the awful code I have been using to get multiply imputed values (I'm aware this is 100% not how I should do it, but not sure how to work around the wide dataset issue):

Code:

tsset sample_collect_date tsfill tsset, clear mi set flong mi tsset sample_collect_date cd "H:\myfilelocation" mi register imputed ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr mi imput chained (regress) ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr, add(10) rseed(7834131) preserve forval i = 1/10 { mi extract `i', clear save mi_dataset_`i', replace restore, preserve } forval i = 1/10 { use mi_dataset_`i', clear sort sample_collect_date gen id = _n save mi_dataset_`i', replace }

And here is what I had considered using for -mipolate- if it sounds reasonable to use sample_collect_date as the xvar predictor of yvar:

Code:

mipolate ln_n1n2_avg_mm sample_collect_date, gen(mip_n1n2_avg) idw(3)

I would love any thoughts the forum may be able to share on the best way to deal with my data structure. I'm aware this is sort of a combined coding/stats issue and I recognize that the answer may well be "do more homework on your own"--that would be reasonable, but if anyone has thoughts or opinions to share I would find them deeply useful as a sanity check.
Tags: None
Erik Ruzek

Join Date: Oct 2017

Posts: 429
#2

30 Aug 2024, 08:13

I am having a little trouble understanding your data. Can you help to connect these two sentences in particular?

The data are sampled from the same wastewater treatment plant approximately twice each week from beginning 2022 to end of 2023.

I take this to mean that all your samples come from one wasterwater treatment plant.

Since I had >600 observations for one wastewater treatment plant (i.e., one "individual"), a reshape to wide would have resulted in a massively wide dataset with one row.

Now it sounds as if you have samples from multiple wastewater plants, suggesting a panel or multilevel structure.
Comment
Maria Sundaram

Join Date: Jun 2016

Posts: 45
#3

30 Aug 2024, 09:02

Thanks Erik! I just have samples from one wastewater treatment plant--your first interpretation was correct.

My interpretation of the requirements of -mi- were that I would want one 'row' for each series. (In this case, I only have one series because I only have data from one treatment plant.) So my interpretation was that the data should look something like this:
id pcr_jan1_2022 pcr_jan2_2022 pcr_jan3_2022 pcr_jan4_2022 pcr_jan5_2022

1 13.5 14.0 15.2 . 13.2

Rather than:
date pcr id

01jan2022 13.5 1

02jan2022 14.0 1

03jan2022 15.2 1

04jan2022 . 1

05jan2022 13.2 1

But maybe I was wrong about that.

The documentation for -mi- seems to suggest that if I'm interested in using dates nearby in time to impute missing PCR values, then the data should be wide rather than long. But with almost 700 unique date observations, that's not a feasible way to go for me.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#4

30 Aug 2024, 09:32

Originally posted by Maria Sundaram View Post

[...] on days when the wastewater was not sampled.

Why was the wastewater not sampled on those days? Is it reasonable to assume missing completely at random? If so, you might not have to impute at all.

Also, what is the ultimate goal? Which kind of analyses are you going for?

If you want to stick with mi chained, you might be able to leave the data in long format and include leads and lags of the variables with missing values in the respective conditional models.
2 likes
Comment
Maria Sundaram

Join Date: Jun 2016

Posts: 45
#5

30 Aug 2024, 09:43

Thanks for your reply Daniel!
Q1: Wastewater was not sampled on those days due to low resources. This might be one scenario where data is actually missing completely at random.

Q2: The ultimate goal is to compare SARS-CoV-2 identified in wastewater to percent positives identified in symptomatic individuals in the community. That data also of course isn't complete on every single day. The motivation for imputation/interpolation was to to try to increase the proportion of days where there is available information both for wastewater and for the percent positives for symptomatic individuals in the community. This ideally would help with modeling procedures in the future--less sparse data would have smaller 95% confidence intervals but could also allow for additional covariates in the model beyond what's possible currently.

To your final point, I'd be really interested in sticking with mi chained with data in long format, but I'm not sure of the best/most correct implementation of that, considering that I hope to use PCR values surrounding the missing value in calendar time as the main predictors. Would you be able to point me to a resource that could assist with that?
Comment
Erik Ruzek

Join Date: Oct 2017

Posts: 429
#6

30 Aug 2024, 09:57

That is helpful. If this were my data I would probably use mipolate, creating multple types of interpolated values. Then I would run my analysis on each of the different interpolated values to see if the type of interpolation changes anything about the results and conclusions. All that said, I am only assuming that the use of the date is appropriate for the x variable. It makes some sense if mipolate is using adjacent dates and their associated y values to help it make a prediction about the y value of the missing date. But I do not know enough about the procedure to be confident that is what it is doing. I know more about multiple imputation, but even then, your case is not one I've dealt with personally.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

30 Aug 2024, 10:09

This ideally would help with modeling procedures in the future--less sparse data would have smaller 95% confidence intervals but could also allow for additional covariates in the model beyond what's possible currently.

It would be more correct to say that imputing the missing values would create the illusion of smaller 95% confidence intervals and the illusion of allowing for additional covariates.

Whether single (mipolate) or multiple (mi) imputations are used, the process does not add any new information into the data set. The data created are not real values of the missing variables. In fact, mi by chained equations may well produce imputed values that are not even possible in the real world--yet they are still useful for the purpose of producing less biased coefficient estimates in regression procedures. But you cannot increase your statistical power by using any kind of imputation procedures. If you could, research would become so easy and cheap. For any study just get an N of 2. Then impute thousands or even millions or billions of other values and, voila!, run analyses with 95% CIs too small to print! Yes, the output of a regression done with such a data set will tell you that it has that kind of power, but that is because its calculations of standard errors and other inferential statistics assumes that all the observations are independent (or clusters of them are in the case of multi-level analyses), whereas the imputed observations are mashups that are completely dependent on some or all of the rest of the observations and no correction for that has been applied.

Added: To be clear, I am not arguing against doing imputation. There may be other good reasons for doing so. (Although I strongly suspect, as it seems daniel klein does, that these data are MCAR, which really makes imputation unnecessary.) I'm just saying you should not delude yourself into thinking that augmenting the data in this way will give you any kind of statistical power advantage--it won't.

Last edited by Clyde Schechter; 30 Aug 2024, 10:16.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#8

30 Aug 2024, 10:36

Originally posted by Clyde Schechter View Post

It would be more correct to say that imputing the missing values would create the illusion of smaller 95% confidence intervals and the illusion of allowing for additional covariates.

Whether single (mipolate) or multiple (mi) imputations are used, the process does not add any new information into the data set.
[...]
you cannot increase your statistical power by using any kind of imputation procedures.

While there might be some truth to that statement, I think it is an unjustified generalization and I don't agree with the conclusion.

First, there is a difference between single imputation methods, such as interpolation, and multiple imputation. Single imputation methods generally do not account for the uncertainty associated with the imputed values. Therefore, these methods tend to result in overly liberal tests and too narrow CIs. Multiple imputation, on the other hand, was specifically designed to get the inference right, meaning that tests won't be liberal, and CIs will be close to nominal coverage.

Second, while imputations do not add new information in the sense of sampling additional (independent) units, they do recover (the influence of) unobserved values, provided the MAR assumptions are met. The potential gains in efficiency and power stem from the fact that the non-missing information of all units is included in the analysis modes while the respective information would be lost if those units [edit: the units with missing values on any variable] were deleted in a casewise (listwise) manner.

Last edited by daniel klein; 30 Aug 2024, 10:51.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

30 Aug 2024, 11:06

Multiple imputation, on the other hand, was specifically designed to get the inference right, meaning that tests won't be liberal, and CIs will be close to nominal coverage.

Yes, but this is true when the multiple imputation is carried out on the full data that is used for the analysis.

If I understand #1 & #6 correctly, O.P. wants to build a data set of SARS-CoV2 concentrations in water samples and wants to fill in the missing dates by some imputation scheme. Then that later will be used in conjunction with other data sets to analyze associations with other variables of interest. The imputation, multiple or otherwise, will use only the available SARS-CoV2 concentration measurements and their dates of acquisition, and perhaps some other attributes of the water sample itself (e.g. pH). It will not take into account the data on the other variables in other data sets to be dealt with subsequently. So, I think that such analyses will not fully benefit from the way that multiple imputation calculates the inferential statistics. If she plans to do the multiple imputations in the fully-merged data to be used in the later analyses, then, that is a different procedure, to which my concern would not apply.
Comment
Maria Sundaram

Join Date: Jun 2016

Posts: 45
#10

30 Aug 2024, 11:38

Thanks so much, all. Clyde, I had a sneaking suspicion even as I was writing the reasons for wanting to do multiple imputation that someone would come along and say "no, that's not what that does". I appreciate you setting me straight

Basically the main reason I am interested in this is to avoid having to throw away data collected from the community in the subsequent analysis (the "substantive analysis model" referred to in this publication: https://arxiv.org/pdf/2404.06967). So to be clear, I'm not expecting that this would result in truly smaller 95% CIs or better statistical power in the wastewater data alone. I just want to make use of all the community-level data I have.

For example, if wastewater data has values for January 1, 2, and 5, and I would like to compare those values to infections observed in the community on January 2, 4, and 7, then even though I have 6 data points all in the same week, I only have one data point where wastewater and infections in the community are observed on the same day. There's obviously multiple ways to handle this, including binning/averaging values by week or month, but part of the original motivation for doing the analysis is to make use of the granularity of data that we currently have. So I was thinking that if I could impute wastewater values for January 4 and 7, then I could make use of the daily data observed on January 4 and 7 in the community.

Please do set me straight again if this is an incorrect interpretation of what multiple imputation can do!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#11

30 Aug 2024, 12:09

Originally posted by Clyde Schechter View Post

If she plans to do the multiple imputations in the fully-merged data to be used in the later analyses, then, that is a different procedure, to which my concern would not apply.

That's what I had in mind when characterizing your earlier post an "unjustified generalization".

Originally posted by Clyde Schechter View Post

If I understand #1 & #6 correctly, O.P. wants to build a data set of SARS-CoV2 concentrations in water samples and wants to fill in the missing dates by some imputation scheme. Then that later will be used in conjunction with other data sets to analyze associations with other variables of interest. The imputation, multiple or otherwise, will […] not take into account the data on the other variables in other data sets […] So, I think that such analyses will not fully benefit from the […] multiple imputation […]

Actually, it’s worse. When the correlations between variables across datasets are omitted from the imputation model, the respective point estimates from the substantive models will be biased towards zero. Using such a strategy will not even get the point estimates right, making inference pretty much pointless.

Last edited by daniel klein; 30 Aug 2024, 12:12.
Comment
Maria Sundaram

Join Date: Jun 2016

Posts: 45
#12

30 Aug 2024, 12:31

Thanks Daniel!

In your second point, you are suggesting that I ought to include the community-based infection values in the imputation step--do I have that correctly? If so, I'm a bit confused, as the motivation behind the overall analysis is to identify/quantify the relationship between wastewater-based PCR values and community-based infection. Including community-based infection values at the step of imputation only to then turn around and do it again in the subsequent substantive analysis seems... duplicative to me?

Again, the usual caveat of my lack of expertise applies here. I'm (obviously) not an expert on any of these methods so I really appreciate everyone's insight here!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#13

30 Aug 2024, 12:32

I should comment on interpolation.

The leading application of interpolation is when working with deterministic functions known in advance to be very smooth. Indeed I can still remember using printed tables to interpolate -- to go a little beyond what was tabulated -- for say logarithmic and trigonometric functions. Such use became pointless by say the 1970s when almost anyone who wanted to do this had access to good quality electronic calculators, to say nothing of computers, in which such functions could be calculated as accurately as you wish.

There are statistical applications of interpolation, as for example interpolation within quantile functions which in principle are monotonic.

But interpolation of noisy time series while perfectly computable with commands such as mipolate from SSC, seems unlikely to be a good idea.

It can't estimate irregular fluctuations it doesn't know about.

Inferential properties of any model fit are much in doubt: You don't really have as many degrees of freedom as you think and fits are unlikely to be as good as they may appear.

Gaussian process regression might be a better deal, but I don't know much about it or of Stata applications.
1 like
Comment

daniel klein

Join Date: Mar 2014
Posts: 3850

#14

30 Aug 2024, 13:21

Originally posted by Maria Sundaram View Post

In your second point, you are suggesting that I ought to include the community-based infection values in the imputation step--do I have that correctly?

Yes. Generally, the imputation model must include (at least) all variables used in the substantive models.

Originally posted by Maria Sundaram View Post

If so, I'm a bit confused, as the motivation behind the overall analysis is to identify/quantify the relationship between wastewater-based PCR values and community-based infection. Including community-based infection values at the step of imputation only to then turn around and do it again in the subsequent substantive analysis seems... duplicative to me?

Think of it this way: Suppose we have fully observed data on both water-based PCR and community-based infection. Suppose further, there is a correlation between water-based PCR and community-based infection. Now, suppose, we deleted some values in water-based PCR, either completely at random or depending on the values of community-based infection. If we were now to impute the deleted (missing) values in water-based PCR independently from the observed values of community-based infection, the resulting imputed values in water-based PCR would be uncorrelated with observed values in community-based infection. As a consequence, our substantive analyses would underestimate the true correlation.

Here is a quick (imperfect) illustration

Code:

. version 18

.
. // setup
. clear

. set obs 100
Number of observations (_N) was 0, now 100.

. set seed 42

. generate infection = rnormal()

. generate pcr = infection + rnormal()

.
. // true association
. regress pcr infection

      Source |       SS           df       MS      Number of obs   =       100
-------------+----------------------------------   F(1, 98)        =    130.96
       Model |  107.101168         1  107.101168   Prob > F        =    0.0000
    Residual |  80.1482791        98  .817839583   R-squared       =    0.5720
-------------+----------------------------------   Adj R-squared   =    0.5676
       Total |  187.249447        99  1.89140855   Root MSE        =    .90434

------------------------------------------------------------------------------
         pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   infection |   1.051622    .091896    11.44   0.000     .8692574    1.233987
       _cons |   .0526003   .0913436     0.58   0.566     -.128668    .2338686
------------------------------------------------------------------------------

.
. // create MAR
. replace pcr = . if infection < 0 & runiform() < .5
(23 real changes made, 23 to missing)

.
. // linear model is still unbiased in this case (missings depend linearly on X only)
. // this does not seem to be well known
. // anyway, the CIs are naturally wider
. regress pcr infection

      Source |       SS           df       MS      Number of obs   =        77
-------------+----------------------------------   F(1, 75)        =     79.23
       Model |  74.2717152         1  74.2717152   Prob > F        =    0.0000
    Residual |  70.3085528        75  .937447371   R-squared       =    0.5137
-------------+----------------------------------   Adj R-squared   =    0.5072
       Total |  144.580268        76  1.90237195   Root MSE        =    .96822

------------------------------------------------------------------------------
         pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   infection |   1.059513   .1190332     8.90   0.000     .8223871     1.29664
       _cons |   .0269195   .1205366     0.22   0.824    -.2132018    .2670408
------------------------------------------------------------------------------

.
. // impute ignoring association
. mi set flong

. mi register imputed pcr
(23 m=0 obs now marked as incomplete)

. mi impute regress pcr , add(5)

(output omitted)
.
. // estimate is biased towards zero
. mi estimate : regress pcr infection

Multiple-imputation estimates                   Imputations       =          5
Linear regression                               Number of obs     =        100
                                                Average RVI       =     0.7345
                                                Largest FMI       =     0.5136
                                                Complete DF       =         98
DF adjustment:   Small sample                   DF:     min       =      13.89
                                                        avg       =      14.16
                                                        max       =      14.42
Model F test:       Equal FMI                   F(   1,   14.4)   =      22.58
Within VCE type:          OLS                   Prob > F          =     0.0003

------------------------------------------------------------------------------
         pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   infection |   .7750225   .1631108     4.75   0.000     .4261293    1.123916
       _cons |   .3002586   .1636124     1.84   0.088     -.050905    .6514221
------------------------------------------------------------------------------

.
. // clear imputation
. mi extract 0 , clear

.
. // impute; this time account for association
. mi set flong

. mi register imputed pcr
(23 m=0 obs now marked as incomplete)

. mi impute regress pcr = infection , add(5)

(output omitted)

.
. // recover true association
. // CIs are a little too wide because FMI requires approx. 17 imputed datasets and we only have 5
. mi estimate : regress pcr infection

Multiple-imputation estimates                   Imputations       =          5
Linear regression                               Number of obs     =        100
                                                Average RVI       =     0.1487
                                                Largest FMI       =     0.1666
                                                Complete DF       =         98
DF adjustment:   Small sample                   DF:     min       =      54.88
                                                        avg       =      58.14
                                                        max       =      61.40
Model F test:       Equal FMI                   F(   1,   61.4)   =     108.28
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
         pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   infection |   1.083989   .1041729    10.41   0.000     .8757095    1.292268
       _cons |   .0040173   .1049229     0.04   0.970    -.2062632    .2142978
------------------------------------------------------------------------------

Last edited by daniel klein; 30 Aug 2024, 13:28. Reason: added "imperfect" to describe the illustrating example; bivariate regression, no leads or lags of pcr in imputation model, etc.

Comment

Maria Sundaram

Join Date: Jun 2016

Posts: 45
#15

08 Sep 2024, 16:19

(edited to add: Thank you so much Nick for your thoughts on interpolation--they are super helpful!)

Thanks Daniel (and my apology for the late reply on this). I really appreciate this and it will definitely help me improve my MI implementation. Thank you!

I think I may still be implementing the MI situation incorrectly, though, because I wound up with an 'insufficient observations' error after MICE and "mi estimate: xtreg". Perhaps it's the issue that Nick was talking about, where I don't have the precision or the observations that I 'think' I do. But here's a sample of what that code looks like--if I'm going wrong and it's obvious, could you let me know?

Code:

frame change mice_panel gen sample_collect_date = date(SampleCollectDate, "MDY") format sample_collect_date %td tsset sample_collect_date tsfill //this allows us to have other days in "between" the observed days tsset, clear mi set flong mi xtset sample_collect_date drop if sample_collect_date < td(06Jan2022) missings report mi register imputed ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr pctpos_mm mi imput chained (regress) ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr pctpos_mm, add(40) rseed(7834131) mi estimate: xtreg pctpos_mm ln_n1n2_avg_mm_dpcr tests_perpop_mm tests_mm

The 'estimate' statement yields the following error:

insufficient observations
an error occurred when mi estimate executed xtreg on m=1
r(2001);

Looking at the data, it seems that there is indeed only one observation for every 7 days (i.e., the 'mi estimate: xtreg' statement doesn't seem to be making use of imputed values). But maybe this just boils down to another misconception I had about what MI is capable of.

Last edited by Maria Sundaram; 08 Sep 2024, 16:24.
Comment

id	pcr_jan1_2022	pcr_jan2_2022	pcr_jan3_2022	pcr_jan4_2022	pcr_jan5_2022
1	13.5	14.0	15.2	.	13.2

date	pcr	id
01jan2022	13.5	1
02jan2022	14.0	1
03jan2022	15.2	1
04jan2022	.	1
05jan2022	13.2	1

Announcement