Dear Statalist,
I have a dataset containing information on SARS-CoV-2 gene concentration in wastewater over time. The data are sampled from the same wastewater treatment plant approximately twice each week from beginning 2022 to end of 2023. I am interested in 'filling in the gaps' for SARS-CoV-2 concentration (variables ln_pmmov_mm_dpcr and ln_n1n2_avg_mm_dpcr) on days when the wastewater was not sampled. I am hoping not to have to tell Stata up front which type of distribution these 'missing' values should come from, as there is substantial variability in the values over time. It would be instead reasonable to ask Stata to make a guess for the missing values based mostly on values close by in time. I am running StataNow/BE 18.5.
At first, I thought it might be nice to use multiple imputation with chained equations. I struggled with this approach: it seemed that the most reasonable way for me to implement this approach was to reshape my long dataset to wide. Since I had >600 observations for one wastewater treatment plant (i.e., one "individual"), a reshape to wide would have resulted in a massively wide dataset with one row. This seemed infeasible (and actually for me, running Stata BE, it is impossible to have that many variables in the dataset). However, I would love to be proven wrong about this, and it's reasonable I would be wrong about it as I am really unfamiliar with the -mi- suite of commands and procedures. I have some halfhearted code below that uses -flong- rather than a wide data structure. (I am sure it is not correct, and only runs when there is only one predictor in the model.)
The -mipolate- command (written by Nick Cox) seemed like a nice way around this "massively wide dataset" issue. However, it seems that -mipolate- requires an xvar predictor of yvar. I have several 'candidate' predictors (e.g., pH of the wastewater sample, flow rate at time of sampling, etc.), but understand that -mipolate- by itself is agnostic to panel structure. Would it be reasonable, then, to tell -mipolate- that the best predictor in my case is the sample collection date? In this case, I would want to use the -idw- option for -mipolate-, as long as I can somehow let -mipolate- know that the data are changing over time (and/or that time is an important factor in deciding what the interpolated value should be).
Finally, I have seen in other circumstances that sometimes people use multiple linear mixed models with random intercepts to solve problems such as these. I would prefer to try to avoid this if possible, because I think that would require me telling Stata that I believe all the missing values come from the same distribution (when I am not certain that they do). However, maybe someone more talented in statistics would be able to confirm whether that seems like a large or a trivial problem.
An example of my dataset is provided below:
Here is the awful code I have been using to get multiply imputed values (I'm aware this is 100% not how I should do it, but not sure how to work around the wide dataset issue):
And here is what I had considered using for -mipolate- if it sounds reasonable to use sample_collect_date as the xvar predictor of yvar:
I would love any thoughts the forum may be able to share on the best way to deal with my data structure. I'm aware this is sort of a combined coding/stats issue and I recognize that the answer may well be "do more homework on your own"--that would be reasonable, but if anyone has thoughts or opinions to share I would find them deeply useful as a sanity check.
I have a dataset containing information on SARS-CoV-2 gene concentration in wastewater over time. The data are sampled from the same wastewater treatment plant approximately twice each week from beginning 2022 to end of 2023. I am interested in 'filling in the gaps' for SARS-CoV-2 concentration (variables ln_pmmov_mm_dpcr and ln_n1n2_avg_mm_dpcr) on days when the wastewater was not sampled. I am hoping not to have to tell Stata up front which type of distribution these 'missing' values should come from, as there is substantial variability in the values over time. It would be instead reasonable to ask Stata to make a guess for the missing values based mostly on values close by in time. I am running StataNow/BE 18.5.
At first, I thought it might be nice to use multiple imputation with chained equations. I struggled with this approach: it seemed that the most reasonable way for me to implement this approach was to reshape my long dataset to wide. Since I had >600 observations for one wastewater treatment plant (i.e., one "individual"), a reshape to wide would have resulted in a massively wide dataset with one row. This seemed infeasible (and actually for me, running Stata BE, it is impossible to have that many variables in the dataset). However, I would love to be proven wrong about this, and it's reasonable I would be wrong about it as I am really unfamiliar with the -mi- suite of commands and procedures. I have some halfhearted code below that uses -flong- rather than a wide data structure. (I am sure it is not correct, and only runs when there is only one predictor in the model.)
The -mipolate- command (written by Nick Cox) seemed like a nice way around this "massively wide dataset" issue. However, it seems that -mipolate- requires an xvar predictor of yvar. I have several 'candidate' predictors (e.g., pH of the wastewater sample, flow rate at time of sampling, etc.), but understand that -mipolate- by itself is agnostic to panel structure. Would it be reasonable, then, to tell -mipolate- that the best predictor in my case is the sample collection date? In this case, I would want to use the -idw- option for -mipolate-, as long as I can somehow let -mipolate- know that the data are changing over time (and/or that time is an important factor in deciding what the interpolated value should be).
Finally, I have seen in other circumstances that sometimes people use multiple linear mixed models with random intercepts to solve problems such as these. I would prefer to try to avoid this if possible, because I think that would require me telling Stata that I believe all the missing values come from the same distribution (when I am not certain that they do). However, maybe someone more talented in statistics would be able to confirm whether that seems like a large or a trivial problem.
An example of my dataset is provided below:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int sample_collect_date double(ph_mm flow_rate_mm) float(ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr) 22651 7.67 33.65 19.72869 14.237268 22653 7.55 33.33 19.78993 14.2145 22654 7.69 33.82 19.72 13.95687 22655 7.9 34.15 19.96952 13.528894 22656 7.68 34.56 19.29398 13.24352 22657 7.61 34.4 18.551708 13.530492 22658 7.59 35.31 18.97065 12.838408 22660 7.56 32.83 19.286436 13.168484 22661 7.62 33.42 19.19116 13.094458 22662 7.74 34.69 19.715847 13.27106 22663 7.63 34.36 19.61545 12.855686 22664 7.62 33.84 19.49199 12.944575 22665 7.57 34.04 19.5535 14.668052 22667 7.54 34.13 19.59835 15.28725 22668 7.65 33.83 19.72 13.982515 22669 7.76 35.07 19.625214 13.568893 22671 7.83 35.55 19.58283 14.082418 22672 7.68 35.31 19.508703 13.549563 22674 7.69 34.58 19.48801 12.253338 22675 7.73 35.19 19.58358 12.79608 22676 7.73 35.58 19.719017 12.006645 22677 7.66 35.16 19.67184 12.659946 22678 7.85 35.3 19.634644 12.654342 22679 7.78 34.62 18.995495 15.99823 22681 7.7 34.26 19.49582 13.887273 22682 7.75 35.16 19.35253 12.862556 22683 7.87 35.34 19.612177 13.757033 22684 7.85 35.3 19.308737 12.948771 22685 7.72 34.98 19.44844 13.318457 22686 7.8 34.54 19.26173 12.133932 22688 7.69 33.94 19.381716 12.11571 22689 7.7 34.39 19.436636 12.346183 22690 7.82 35.16 19.55105 12.988695 22691 7.7 34.66 19.455856 11.99387 22692 7.66 34.51 19.613997 12.609937 22693 7.71 34.48 19.606083 13.275462 22695 7.72 33.82 20.38286 13.681842 22696 7.74 35.39 19.69235 12.759498 22697 7.71 34.68 20.02241 11.7966 22698 7.64 33.6 19.587824 12.673573 22699 7.79 34.46 19.885324 13.510506 22700 7.63 34.03 19.342875 13.118195 22702 7.69 33.97 19.707045 12.13221 22703 7.63 35.03 19.8844 11.11543 22704 7.79 35.62 19.67812 11.313498 22705 7.61 35.96 19.644926 11.739583 22706 7.64 35.75 19.87014 12.161912 22707 7.69 35.49 19.73376 12.771386 22709 7.55 37.03 19.654406 14.087673 22710 7.64 39.78 19.522354 13.001777 end format %td sample_collect_date
Here is the awful code I have been using to get multiply imputed values (I'm aware this is 100% not how I should do it, but not sure how to work around the wide dataset issue):
Code:
tsset sample_collect_date tsfill tsset, clear mi set flong mi tsset sample_collect_date cd "H:\myfilelocation" mi register imputed ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr mi imput chained (regress) ln_pmmov_mm_dpcr ln_n1n2_avg_mm_dpcr, add(10) rseed(7834131) preserve forval i = 1/10 { mi extract `i', clear save mi_dataset_`i', replace restore, preserve } forval i = 1/10 { use mi_dataset_`i', clear sort sample_collect_date gen id = _n save mi_dataset_`i', replace }
And here is what I had considered using for -mipolate- if it sounds reasonable to use sample_collect_date as the xvar predictor of yvar:
Code:
mipolate ln_n1n2_avg_mm sample_collect_date, gen(mip_n1n2_avg) idw(3)
Comment