Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best practices for imputing missing longitudinal data: mi, mipolate, MLMM, or others?

    Dear Statalist,

    I have a dataset containing information on SARS-CoV-2 gene concentration in wastewater over time. The data are sampled from the same wastewater treatment plant approximately twice each week from beginning 2022 to end of 2023. I am interested in 'filling in the gaps' for SARS-CoV-2 concentration (variables ln_pmmov_mm_dpcr and ln_n1n2_avg_mm_dpcr) on days when the wastewater was not sampled. I am hoping not to have to tell Stata up front which type of distribution these 'missing' values should come from, as there is substantial variability in the values over time. It would be instead reasonable to ask Stata to make a guess for the missing values based mostly on values close by in time. I am running StataNow/BE 18.5.

    At first, I thought it might be nice to use multiple imputation with chained equations. I struggled with this approach: it seemed that the most reasonable way for me to implement this approach was to reshape my long dataset to wide. Since I had >600 observations for one wastewater treatment plant (i.e., one "individual"), a reshape to wide would have resulted in a massively wide dataset with one row. This seemed infeasible (and actually for me, running Stata BE, it is impossible to have that many variables in the dataset). However, I would love to be proven wrong about this, and it's reasonable I would be wrong about it as I am really unfamiliar with the -mi- suite of commands and procedures. I have some halfhearted code below that uses -flong- rather than a wide data structure. (I am sure it is not correct, and only runs when there is only one predictor in the model.)

    The -mipolate- command (written by Nick Cox) seemed like a nice way around this "massively wide dataset" issue. However, it seems that -mipolate- requires an xvar predictor of yvar. I have several 'candidate' predictors (e.g., pH of the wastewater sample, flow rate at time of sampling, etc.), but understand that -mipolate- by itself is agnostic to panel structure. Would it be reasonable, then, to tell -mipolate- that the best predictor in my case is the sample collection date? In this case, I would want to use the -idw- option for -mipolate-, as long as I can somehow let -mipolate- know that the data are changing over time (and/or that time is an important factor in deciding what the interpolated value should be).

    Finally, I have seen in other circumstances that sometimes people use multiple linear mixed models with random intercepts to solve problems such as these. I would prefer to try to avoid this if possible, because I think that would require me telling Stata that I believe all the missing values come from the same distribution (when I am not certain that they do). However, maybe someone more talented in statistics would be able to confirm whether that seems like a large or a trivial problem.

    An example of my dataset is provided below:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int sample_collect_date double(ph_mm flow_rate_mm) float(ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr)
    22651 7.67 33.65  19.72869 14.237268
    22653 7.55 33.33  19.78993   14.2145
    22654 7.69 33.82     19.72  13.95687
    22655  7.9 34.15  19.96952 13.528894
    22656 7.68 34.56  19.29398  13.24352
    22657 7.61  34.4 18.551708 13.530492
    22658 7.59 35.31  18.97065 12.838408
    22660 7.56 32.83 19.286436 13.168484
    22661 7.62 33.42  19.19116 13.094458
    22662 7.74 34.69 19.715847  13.27106
    22663 7.63 34.36  19.61545 12.855686
    22664 7.62 33.84  19.49199 12.944575
    22665 7.57 34.04   19.5535 14.668052
    22667 7.54 34.13  19.59835  15.28725
    22668 7.65 33.83     19.72 13.982515
    22669 7.76 35.07 19.625214 13.568893
    22671 7.83 35.55  19.58283 14.082418
    22672 7.68 35.31 19.508703 13.549563
    22674 7.69 34.58  19.48801 12.253338
    22675 7.73 35.19  19.58358  12.79608
    22676 7.73 35.58 19.719017 12.006645
    22677 7.66 35.16  19.67184 12.659946
    22678 7.85  35.3 19.634644 12.654342
    22679 7.78 34.62 18.995495  15.99823
    22681  7.7 34.26  19.49582 13.887273
    22682 7.75 35.16  19.35253 12.862556
    22683 7.87 35.34 19.612177 13.757033
    22684 7.85  35.3 19.308737 12.948771
    22685 7.72 34.98  19.44844 13.318457
    22686  7.8 34.54  19.26173 12.133932
    22688 7.69 33.94 19.381716  12.11571
    22689  7.7 34.39 19.436636 12.346183
    22690 7.82 35.16  19.55105 12.988695
    22691  7.7 34.66 19.455856  11.99387
    22692 7.66 34.51 19.613997 12.609937
    22693 7.71 34.48 19.606083 13.275462
    22695 7.72 33.82  20.38286 13.681842
    22696 7.74 35.39  19.69235 12.759498
    22697 7.71 34.68  20.02241   11.7966
    22698 7.64  33.6 19.587824 12.673573
    22699 7.79 34.46 19.885324 13.510506
    22700 7.63 34.03 19.342875 13.118195
    22702 7.69 33.97 19.707045  12.13221
    22703 7.63 35.03   19.8844  11.11543
    22704 7.79 35.62  19.67812 11.313498
    22705 7.61 35.96 19.644926 11.739583
    22706 7.64 35.75  19.87014 12.161912
    22707 7.69 35.49  19.73376 12.771386
    22709 7.55 37.03 19.654406 14.087673
    22710 7.64 39.78 19.522354 13.001777
    end
    format %td sample_collect_date


    Here is the awful code I have been using to get multiply imputed values (I'm aware this is 100% not how I should do it, but not sure how to work around the wide dataset issue):
    Code:
    tsset sample_collect_date
    tsfill
    tsset, clear
    mi set flong
    mi tsset sample_collect_date
    cd "H:\myfilelocation"
    mi register imputed ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr 
    mi imput chained (regress) ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr, add(10) rseed(7834131) 
    preserve
    forval i = 1/10 {
        mi extract `i', clear
        save mi_dataset_`i', replace
        restore, preserve
    }
    forval i = 1/10 {
        use mi_dataset_`i', clear
        sort sample_collect_date
        gen id = _n
        save mi_dataset_`i', replace
    }

    And here is what I had considered using for -mipolate- if it sounds reasonable to use sample_collect_date as the xvar predictor of yvar:
    Code:
    mipolate ln_n1n2_avg_mm sample_collect_date, gen(mip_n1n2_avg) idw(3)
    I would love any thoughts the forum may be able to share on the best way to deal with my data structure. I'm aware this is sort of a combined coding/stats issue and I recognize that the answer may well be "do more homework on your own"--that would be reasonable, but if anyone has thoughts or opinions to share I would find them deeply useful as a sanity check.

  • #2
    I am having a little trouble understanding your data. Can you help to connect these two sentences in particular?
    The data are sampled from the same wastewater treatment plant approximately twice each week from beginning 2022 to end of 2023.
    I take this to mean that all your samples come from one wasterwater treatment plant.
    Since I had >600 observations for one wastewater treatment plant (i.e., one "individual"), a reshape to wide would have resulted in a massively wide dataset with one row.
    Now it sounds as if you have samples from multiple wastewater plants, suggesting a panel or multilevel structure.

    Comment


    • #3
      Thanks Erik! I just have samples from one wastewater treatment plant--your first interpretation was correct.

      My interpretation of the requirements of -mi- were that I would want one 'row' for each series. (In this case, I only have one series because I only have data from one treatment plant.) So my interpretation was that the data should look something like this:
      id pcr_jan1_2022 pcr_jan2_2022 pcr_jan3_2022 pcr_jan4_2022 pcr_jan5_2022
      1 13.5 14.0 15.2 . 13.2

      Rather than:
      date pcr id
      01jan2022 13.5 1
      02jan2022 14.0 1
      03jan2022 15.2 1
      04jan2022 . 1
      05jan2022 13.2 1
      But maybe I was wrong about that.

      The documentation for -mi- seems to suggest that if I'm interested in using dates nearby in time to impute missing PCR values, then the data should be wide rather than long. But with almost 700 unique date observations, that's not a feasible way to go for me.

      Comment


      • #4
        Originally posted by Maria Sundaram View Post
        [...] on days when the wastewater was not sampled.
        Why was the wastewater not sampled on those days? Is it reasonable to assume missing completely at random? If so, you might not have to impute at all.

        Also, what is the ultimate goal? Which kind of analyses are you going for?


        If you want to stick with mi chained, you might be able to leave the data in long format and include leads and lags of the variables with missing values in the respective conditional models.

        Comment


        • #5
          Thanks for your reply Daniel!
          • Q1: Wastewater was not sampled on those days due to low resources. This might be one scenario where data is actually missing completely at random.
          • Q2: The ultimate goal is to compare SARS-CoV-2 identified in wastewater to percent positives identified in symptomatic individuals in the community. That data also of course isn't complete on every single day. The motivation for imputation/interpolation was to to try to increase the proportion of days where there is available information both for wastewater and for the percent positives for symptomatic individuals in the community. This ideally would help with modeling procedures in the future--less sparse data would have smaller 95% confidence intervals but could also allow for additional covariates in the model beyond what's possible currently.
          To your final point, I'd be really interested in sticking with mi chained with data in long format, but I'm not sure of the best/most correct implementation of that, considering that I hope to use PCR values surrounding the missing value in calendar time as the main predictors. Would you be able to point me to a resource that could assist with that?

          Comment


          • #6
            That is helpful. If this were my data I would probably use mipolate, creating multple types of interpolated values. Then I would run my analysis on each of the different interpolated values to see if the type of interpolation changes anything about the results and conclusions. All that said, I am only assuming that the use of the date is appropriate for the x variable. It makes some sense if mipolate is using adjacent dates and their associated y values to help it make a prediction about the y value of the missing date. But I do not know enough about the procedure to be confident that is what it is doing. I know more about multiple imputation, but even then, your case is not one I've dealt with personally.

            Comment


            • #7
              This ideally would help with modeling procedures in the future--less sparse data would have smaller 95% confidence intervals but could also allow for additional covariates in the model beyond what's possible currently.
              It would be more correct to say that imputing the missing values would create the illusion of smaller 95% confidence intervals and the illusion of allowing for additional covariates.

              Whether single (mipolate) or multiple (mi) imputations are used, the process does not add any new information into the data set. The data created are not real values of the missing variables. In fact, mi by chained equations may well produce imputed values that are not even possible in the real world--yet they are still useful for the purpose of producing less biased coefficient estimates in regression procedures. But you cannot increase your statistical power by using any kind of imputation procedures. If you could, research would become so easy and cheap. For any study just get an N of 2. Then impute thousands or even millions or billions of other values and, voila!, run analyses with 95% CIs too small to print! Yes, the output of a regression done with such a data set will tell you that it has that kind of power, but that is because its calculations of standard errors and other inferential statistics assumes that all the observations are independent (or clusters of them are in the case of multi-level analyses), whereas the imputed observations are mashups that are completely dependent on some or all of the rest of the observations and no correction for that has been applied.

              Added: To be clear, I am not arguing against doing imputation. There may be other good reasons for doing so. (Although I strongly suspect, as it seems daniel klein does, that these data are MCAR, which really makes imputation unnecessary.) I'm just saying you should not delude yourself into thinking that augmenting the data in this way will give you any kind of statistical power advantage--it won't.
              Last edited by Clyde Schechter; 30 Aug 2024, 10:16.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                It would be more correct to say that imputing the missing values would create the illusion of smaller 95% confidence intervals and the illusion of allowing for additional covariates.

                Whether single (mipolate) or multiple (mi) imputations are used, the process does not add any new information into the data set.
                [...]
                you cannot increase your statistical power by using any kind of imputation procedures.
                While there might be some truth to that statement, I think it is an unjustified generalization and I don't agree with the conclusion.

                First, there is a difference between single imputation methods, such as interpolation, and multiple imputation. Single imputation methods generally do not account for the uncertainty associated with the imputed values. Therefore, these methods tend to result in overly liberal tests and too narrow CIs. Multiple imputation, on the other hand, was specifically designed to get the inference right, meaning that tests won't be liberal, and CIs will be close to nominal coverage.

                Second, while imputations do not add new information in the sense of sampling additional (independent) units, they do recover (the influence of) unobserved values, provided the MAR assumptions are met. The potential gains in efficiency and power stem from the fact that the non-missing information of all units is included in the analysis modes while the respective information would be lost if those units [edit: the units with missing values on any variable] were deleted in a casewise (listwise) manner.
                Last edited by daniel klein; 30 Aug 2024, 10:51.

                Comment


                • #9
                  Multiple imputation, on the other hand, was specifically designed to get the inference right, meaning that tests won't be liberal, and CIs will be close to nominal coverage.
                  Yes, but this is true when the multiple imputation is carried out on the full data that is used for the analysis.

                  If I understand #1 & #6 correctly, O.P. wants to build a data set of SARS-CoV2 concentrations in water samples and wants to fill in the missing dates by some imputation scheme. Then that later will be used in conjunction with other data sets to analyze associations with other variables of interest. The imputation, multiple or otherwise, will use only the available SARS-CoV2 concentration measurements and their dates of acquisition, and perhaps some other attributes of the water sample itself (e.g. pH). It will not take into account the data on the other variables in other data sets to be dealt with subsequently. So, I think that such analyses will not fully benefit from the way that multiple imputation calculates the inferential statistics. If she plans to do the multiple imputations in the fully-merged data to be used in the later analyses, then, that is a different procedure, to which my concern would not apply.

                  Comment


                  • #10
                    Thanks so much, all. Clyde, I had a sneaking suspicion even as I was writing the reasons for wanting to do multiple imputation that someone would come along and say "no, that's not what that does". I appreciate you setting me straight

                    Basically the main reason I am interested in this is to avoid having to throw away data collected from the community in the subsequent analysis (the "substantive analysis model" referred to in this publication: https://arxiv.org/pdf/2404.06967). So to be clear, I'm not expecting that this would result in truly smaller 95% CIs or better statistical power in the wastewater data alone. I just want to make use of all the community-level data I have.

                    For example, if wastewater data has values for January 1, 2, and 5, and I would like to compare those values to infections observed in the community on January 2, 4, and 7, then even though I have 6 data points all in the same week, I only have one data point where wastewater and infections in the community are observed on the same day. There's obviously multiple ways to handle this, including binning/averaging values by week or month, but part of the original motivation for doing the analysis is to make use of the granularity of data that we currently have. So I was thinking that if I could impute wastewater values for January 4 and 7, then I could make use of the daily data observed on January 4 and 7 in the community.

                    Please do set me straight again if this is an incorrect interpretation of what multiple imputation can do!

                    Comment


                    • #11
                      Originally posted by Clyde Schechter View Post
                      If she plans to do the multiple imputations in the fully-merged data to be used in the later analyses, then, that is a different procedure, to which my concern would not apply.
                      That's what I had in mind when characterizing your earlier post an "unjustified generalization".


                      Originally posted by Clyde Schechter View Post
                      If I understand #1 & #6 correctly, O.P. wants to build a data set of SARS-CoV2 concentrations in water samples and wants to fill in the missing dates by some imputation scheme. Then that later will be used in conjunction with other data sets to analyze associations with other variables of interest. The imputation, multiple or otherwise, will […] not take into account the data on the other variables in other data sets […] So, I think that such analyses will not fully benefit from the […] multiple imputation […]
                      Actually, it’s worse. When the correlations between variables across datasets are omitted from the imputation model, the respective point estimates from the substantive models will be biased towards zero. Using such a strategy will not even get the point estimates right, making inference pretty much pointless.
                      Last edited by daniel klein; 30 Aug 2024, 12:12.

                      Comment


                      • #12
                        Thanks Daniel!

                        In your second point, you are suggesting that I ought to include the community-based infection values in the imputation step--do I have that correctly? If so, I'm a bit confused, as the motivation behind the overall analysis is to identify/quantify the relationship between wastewater-based PCR values and community-based infection. Including community-based infection values at the step of imputation only to then turn around and do it again in the subsequent substantive analysis seems... duplicative to me?

                        Again, the usual caveat of my lack of expertise applies here. I'm (obviously) not an expert on any of these methods so I really appreciate everyone's insight here!

                        Comment


                        • #13
                          I should comment on interpolation.

                          The leading application of interpolation is when working with deterministic functions known in advance to be very smooth. Indeed I can still remember using printed tables to interpolate -- to go a little beyond what was tabulated -- for say logarithmic and trigonometric functions. Such use became pointless by say the 1970s when almost anyone who wanted to do this had access to good quality electronic calculators, to say nothing of computers, in which such functions could be calculated as accurately as you wish.

                          There are statistical applications of interpolation, as for example interpolation within quantile functions which in principle are monotonic.

                          But interpolation of noisy time series while perfectly computable with commands such as mipolate from SSC, seems unlikely to be a good idea.

                          It can't estimate irregular fluctuations it doesn't know about.

                          Inferential properties of any model fit are much in doubt: You don't really have as many degrees of freedom as you think and fits are unlikely to be as good as they may appear.

                          Gaussian process regression might be a better deal, but I don't know much about it or of Stata applications.

                          Comment


                          • #14
                            Originally posted by Maria Sundaram View Post
                            In your second point, you are suggesting that I ought to include the community-based infection values in the imputation step--do I have that correctly?
                            Yes. Generally, the imputation model must include (at least) all variables used in the substantive models.

                            Originally posted by Maria Sundaram View Post
                            If so, I'm a bit confused, as the motivation behind the overall analysis is to identify/quantify the relationship between wastewater-based PCR values and community-based infection. Including community-based infection values at the step of imputation only to then turn around and do it again in the subsequent substantive analysis seems... duplicative to me?
                            Think of it this way: Suppose we have fully observed data on both water-based PCR and community-based infection. Suppose further, there is a correlation between water-based PCR and community-based infection. Now, suppose, we deleted some values in water-based PCR, either completely at random or depending on the values of community-based infection. If we were now to impute the deleted (missing) values in water-based PCR independently from the observed values of community-based infection, the resulting imputed values in water-based PCR would be uncorrelated with observed values in community-based infection. As a consequence, our substantive analyses would underestimate the true correlation.

                            Here is a quick (imperfect) illustration

                            Code:
                            . version 18
                            
                            .
                            . // setup
                            . clear
                            
                            . set obs 100
                            Number of observations (_N) was 0, now 100.
                            
                            . set seed 42
                            
                            . generate infection = rnormal()
                            
                            . generate pcr = infection + rnormal()
                            
                            .
                            . // true association
                            . regress pcr infection
                            
                                  Source |       SS           df       MS      Number of obs   =       100
                            -------------+----------------------------------   F(1, 98)        =    130.96
                                   Model |  107.101168         1  107.101168   Prob > F        =    0.0000
                                Residual |  80.1482791        98  .817839583   R-squared       =    0.5720
                            -------------+----------------------------------   Adj R-squared   =    0.5676
                                   Total |  187.249447        99  1.89140855   Root MSE        =    .90434
                            
                            ------------------------------------------------------------------------------
                                     pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                            -------------+----------------------------------------------------------------
                               infection |   1.051622    .091896    11.44   0.000     .8692574    1.233987
                                   _cons |   .0526003   .0913436     0.58   0.566     -.128668    .2338686
                            ------------------------------------------------------------------------------
                            
                            .
                            . // create MAR
                            . replace pcr = . if infection < 0 & runiform() < .5
                            (23 real changes made, 23 to missing)
                            
                            .
                            . // linear model is still unbiased in this case (missings depend linearly on X only)
                            . // this does not seem to be well known
                            . // anyway, the CIs are naturally wider
                            . regress pcr infection
                            
                                  Source |       SS           df       MS      Number of obs   =        77
                            -------------+----------------------------------   F(1, 75)        =     79.23
                                   Model |  74.2717152         1  74.2717152   Prob > F        =    0.0000
                                Residual |  70.3085528        75  .937447371   R-squared       =    0.5137
                            -------------+----------------------------------   Adj R-squared   =    0.5072
                                   Total |  144.580268        76  1.90237195   Root MSE        =    .96822
                            
                            ------------------------------------------------------------------------------
                                     pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                            -------------+----------------------------------------------------------------
                               infection |   1.059513   .1190332     8.90   0.000     .8223871     1.29664
                                   _cons |   .0269195   .1205366     0.22   0.824    -.2132018    .2670408
                            ------------------------------------------------------------------------------
                            
                            .
                            . // impute ignoring association
                            . mi set flong
                            
                            . mi register imputed pcr
                            (23 m=0 obs now marked as incomplete)
                            
                            . mi impute regress pcr , add(5)
                            
                            (output omitted)
                            .
                            . // estimate is biased towards zero
                            . mi estimate : regress pcr infection
                            
                            Multiple-imputation estimates                   Imputations       =          5
                            Linear regression                               Number of obs     =        100
                                                                            Average RVI       =     0.7345
                                                                            Largest FMI       =     0.5136
                                                                            Complete DF       =         98
                            DF adjustment:   Small sample                   DF:     min       =      13.89
                                                                                    avg       =      14.16
                                                                                    max       =      14.42
                            Model F test:       Equal FMI                   F(   1,   14.4)   =      22.58
                            Within VCE type:          OLS                   Prob > F          =     0.0003
                            
                            ------------------------------------------------------------------------------
                                     pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                            -------------+----------------------------------------------------------------
                               infection |   .7750225   .1631108     4.75   0.000     .4261293    1.123916
                                   _cons |   .3002586   .1636124     1.84   0.088     -.050905    .6514221
                            ------------------------------------------------------------------------------
                            
                            .
                            . // clear imputation
                            . mi extract 0 , clear
                            
                            .
                            . // impute; this time account for association
                            . mi set flong
                            
                            . mi register imputed pcr
                            (23 m=0 obs now marked as incomplete)
                            
                            . mi impute regress pcr = infection , add(5)
                            
                            (output omitted)
                            
                            .
                            . // recover true association
                            . // CIs are a little too wide because FMI requires approx. 17 imputed datasets and we only have 5
                            . mi estimate : regress pcr infection
                            
                            Multiple-imputation estimates                   Imputations       =          5
                            Linear regression                               Number of obs     =        100
                                                                            Average RVI       =     0.1487
                                                                            Largest FMI       =     0.1666
                                                                            Complete DF       =         98
                            DF adjustment:   Small sample                   DF:     min       =      54.88
                                                                                    avg       =      58.14
                                                                                    max       =      61.40
                            Model F test:       Equal FMI                   F(   1,   61.4)   =     108.28
                            Within VCE type:          OLS                   Prob > F          =     0.0000
                            
                            ------------------------------------------------------------------------------
                                     pcr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                            -------------+----------------------------------------------------------------
                               infection |   1.083989   .1041729    10.41   0.000     .8757095    1.292268
                                   _cons |   .0040173   .1049229     0.04   0.970    -.2062632    .2142978
                            ------------------------------------------------------------------------------
                            Last edited by daniel klein; 30 Aug 2024, 13:28. Reason: added "imperfect" to describe the illustrating example; bivariate regression, no leads or lags of pcr in imputation model, etc.

                            Comment


                            • #15
                              (edited to add: Thank you so much Nick for your thoughts on interpolation--they are super helpful!)

                              Thanks Daniel (and my apology for the late reply on this). I really appreciate this and it will definitely help me improve my MI implementation. Thank you!

                              I think I may still be implementing the MI situation incorrectly, though, because I wound up with an 'insufficient observations' error after MICE and "mi estimate: xtreg". Perhaps it's the issue that Nick was talking about, where I don't have the precision or the observations that I 'think' I do. But here's a sample of what that code looks like--if I'm going wrong and it's obvious, could you let me know?

                              Code:
                              frame change mice_panel
                              gen sample_collect_date = date(SampleCollectDate, "MDY")
                              format sample_collect_date %td
                              tsset sample_collect_date
                              tsfill //this allows us to have other days in "between" the observed days
                              tsset, clear
                              mi set flong
                              mi xtset sample_collect_date
                              drop if sample_collect_date < td(06Jan2022)
                              missings report
                              
                              mi register imputed ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr pctpos_mm
                              
                              mi imput chained (regress) ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr pctpos_mm, add(40) rseed(7834131)
                              
                              mi estimate: xtreg pctpos_mm ln_n1n2_avg_mm_dpcr tests_perpop_mm tests_mm

                              The 'estimate' statement yields the following error:

                              insufficient observations
                              an error occurred when mi estimate executed xtreg on m=1
                              r(2001);


                              Looking at the data, it seems that there is indeed only one observation for every 7 days (i.e., the 'mi estimate: xtreg' statement doesn't seem to be making use of imputed values). But maybe this just boils down to another misconception I had about what MI is capable of.
                              Last edited by Maria Sundaram; 08 Sep 2024, 16:24.

                              Comment

                              Working...
                              X