Limiting the range of the values of the variables in multiple imputation

Joana M. Lima

Join Date: Mar 2015

Posts: 11
#1

Limiting the range of the values of the variables in multiple imputation

07 May 2017, 05:39

Dear all,

I have a dataset of 24 variables collected for 148 countries across 6 years. All except one variable have missing values, with a maximum of 25% missing values for two of the variables.

I have successfully used multiple imputation to create 10 multiply imputed datasets using the following command in STATA13:

mi set flong

mi xtset CountryNum Year

generate CountryYear = CountryNum*Year
mi register regular Country CountryNum Year FreedomOfPress CountryYear

mi register imputed FDIPercGDP ValueChain CorruptionPRS_ICRG InwardFlowsTradeKOF RestrictionsTradeKOF ClusterDevelopment ///
AntiMonopPolicy ExtentMarketing ForeignOwn ImpactRulesFDI TradeBarriers MeanPolCapScore ///
AuditReportStandards DivertPublicFunds EfficacyCorpBoards EthicBehavFirms FavoritGovOfficials ///
InvestProtect IrregPayBribes JudicialIndep MarketDominance ProtectMinShare TradeTariffsPerc

mi impute chained (regress)FDIPercGDP ValueChain CorruptionPRS_ICRG InwardFlowsTradeKOF RestrictionsTradeKOF ClusterDevelopment ///
AntiMonopPolicy ExtentMarketing ForeignOwn ImpactRulesFDI TradeBarriers MeanPolCapScore ///
AuditReportStandards DivertPublicFunds EfficacyCorpBoards EthicBehavFirms FavoritGovOfficials ///
InvestProtect IrregPayBribes JudicialIndep MarketDominance ProtectMinShare TradeTariffsPerc, add(10) rseed(1982) force

mi convert wide, clear

The problem is that the values I obtain in the imputed datasets are severely off range. For example, the variable JudicialIndep should only take values from 0 to 7 and yet I obtain ranges of -10 to 15.

I want to be able to restrict the upper and lower limits of the values that the variables may take. Unfortunately, the information I found so far pertains only to univariate imputation

For example:

http://www.stata.com/manuals13/mimii...miimputeintreg

Could someone kindly advise on how to set ranges for all the variables to be imputed simultaneously?

Thank you very much!

Joana
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3859
#2

07 May 2017, 09:23

Suggestion with respect to the question: use pmm instead of regress for the imputation model.

Further comments:

Maybe an ordered model would even fit better than a linear one, but I cannot comment based on the information give.

Your regular variables do not appear in the imputation model. This is likely a problem if they are used in the substantive model later.

You do not seem to account for the nested data structure in the imputation model, but probably should in some way.

Ten imputed datasets migth be too few given the amount of missing values reported. Check the FMI and adjust the number if needed.

The force option looks scary and usually indicates serious problems. Why do you need it?

Best
Daniel

Last edited by daniel klein; 07 May 2017, 09:25.
Comment
Andrea Discacciati

Join Date: Feb 2016

Posts: 194
#3

07 May 2017, 09:40

In addition to daniel's suggestion.

You could impute JudicialIndep using regress on a logit transform of JudicialIndep and then transform back the imputed values on the original scale. This provided that: JudicialIndep is a continuous variable and that you can assume it follows (at least approximately) a normal distribution on the logit scale.
Comment
Joana M. Lima

Join Date: Mar 2015

Posts: 11
#4

08 May 2017, 07:40

Dear Daniel and Andrea,

Thank you very much for your prompt reply.

I am quite new to MI and don't have a background in statistics so please bear with me as I address your suggestions!

@ Andrea, apologies, I forgot to mention that I had checked each one of my variables for normality and then transformed them accordingly to obtain normal distributions. So, for example, JudicialIndep was raised to the square. But even so, the results are off range.

@ Daniel, as per your suggestion, I tried pmm

mi impute pmm n_FDIPercGDP n_ValueChain n_CorruptionPRS_ICRG n_InwardFlowsTradeKOF n_RestrictionsTradeKOF n_ClusterDevelopment ///
n_AntiMonopPolicy n_ExtentMarketing n_ForeignOwn n_ImpactRulesFDI n_TradeBarriers n_MeanPolCapScore ///
n_AuditReportStandards n_DivertPublicFunds n_EfficacyCorpBoards n_EthicBehavFirms n_FavoritGovOfficials ///
n_InvestProtect n_IrregPayBribes n_JudicialIndep n_MarketDominance n_ProtectMinShare n_TradeTariffsPerc, add(10)

The reason my regular variable don't appear in the model is that I obtain an error message when I do include them. So when,

mi impute pmm Year n_FreedomOfPress CountryYear n_FDIPercGDP n_ValueChain n_CorruptionPRS_ICRG n_InwardFlowsTradeKOF n_RestrictionsTradeKOF n_ClusterDevelopment ///
n_AntiMonopPolicy n_ExtentMarketing n_ForeignOwn n_ImpactRulesFDI n_TradeBarriers n_MeanPolCapScore ///
n_AuditReportStandards n_DivertPublicFunds n_EfficacyCorpBoards n_EthicBehavFirms n_FavoritGovOfficials ///
n_InvestProtect n_IrregPayBribes n_JudicialIndep n_MarketDominance n_ProtectMinShare n_TradeTariffsPerc, add(10)

I obtain the following message "Year: must be registered as imputed; see " and I didn't register the complete information variables as Imputed as per Stata manual instructions.

On your point on the use of "force". All of the non regular variables have missing values. When I dont use "force", I obtain the following message

"n_FDIPercGDP: missing imputed values produced
This may occur when imputation variables are used as independent variables or when independent variables contain missing values. You can specify option force
if you wish to proceed anyway."

This is why I used it. Do you have any other thoughts on this?

You mention the nested structure of the data. I agree. However, after many despairing attempts at employing "ice", I gave up. I found the following paper
http://onlinelibrary.wiley.com/doi/1...jomf.12144/pdf

and decided for the code I shared earlier, assuming that "mi xtset CountryNum Year" would declare the panel nature of the data.

Lastly, you mention FMI. Do you mean " full information maximum likelihood"?

Thank you so very much for your help, I look forward to your thoughts on this.

Best wishes,

Joana
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#5

08 May 2017, 09:59

Quick answers, sorry, I do not have much time these days.

The reason my regular variable don't appear in the model is that I obtain an error message when I do include them.

You are supposed to (a) use chained equations* and (b) enter the regular variables that have no missing values to the right hand side of the equals sign. Something like

Code:

mi impute chained (pmm) imuted_vars = regular_vars ...

Hopefully this will also solve the problem of imputed missing values. Try without the force option to see how it works out.

You mention the nested structure of the data. I agree. However, after many despairing attempts at employing "ice", I gave up. I found the following paper
http://onlinelibrary.wiley.com/doi/1...jomf.12144/pdf

and decided for the code I shared earlier, assuming that "mi xtset CountryNum Year" would declare the panel nature of the data.

I must admit that I am not fully up to date with the latest developments in this area. But, mi xt setting your data merely allows you to run panel estimation models. It does nothing for your imputation model. If I am wrong here, I would be very happy to be corrected.

Lastly, you mention FMI. Do you mean " full information maximum likelihood"?

Nope, I did not mean FIML as an alternative to MI. FMI is the fraction of missing information and it can be used to judge how many imputations you need. See the manual of mi impute and references given there.

Best
Daniel

* Stata will tell you when you have a monotone missing pattern and act accordingly.

Last edited by daniel klein; 08 May 2017, 10:08. Reason: deleted misleading rule of thumb
Comment
Joana M. Lima

Join Date: Mar 2015

Posts: 11
#6

09 May 2017, 04:06

Dear Daniel,

Thank you so much, the code you shared worked!

All the very best,
Joana
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

09 May 2017, 09:55

Originally posted by Andrea Discacciati View Post

In addition to daniel's suggestion.

You could impute JudicialIndep using regress on a logit transform of JudicialIndep and then transform back the imputed values on the original scale. This provided that: JudicialIndep is a continuous variable and that you can assume it follows (at least approximately) a normal distribution on the logit scale.

I'm late to the party, but hopefully this will add something to the discussion. Joana M. Lima said that JudicialIndep "should only take values from 0 to 7..." That implies that it's an ordinal scale. In that case, that variable could be imputed using ordinal logit.

If predictive mean matching for everything worked, then I would say leave things be. However, there is a way to use different imputation models for different variables. In the example code below, I am making up what variable goes with what model, but here goes:

Code:

mi impute chained (pmm) FDIPercGDP ValueChain CorruptionPRS_ICRG /// (logit) AntiMonopPolicy ExtentMarketing ForeignOwn ImpactRulesFDI TradeBarriers MeanPolCapScore /// (ologit) AJudicialIndep MarketDominance ProtectMinShare TradeTariffsPerc /// = (whatever is not missing), knn(10) (other options as necessary)

Again, Judicial Independence could be imputed using ordinal logit, but it doesn't have to be. PMM uses observations in the data to match. I forget if the default is to match with the nearest neighbor, equivalent to the option knn(1) (i.e. k nearest neighbors - 1), but I've come across some guidance saying that you should match from the nearest 10 neighbors or so. (Hence, the bolded knn option above.)

HTML Code:

https://www.ssc.wisc.edu/sscc/pubs/stata_mi_models.htm

Any time you use a specific model to impute stuff, the data should obey the assumptions of the model, or at least not violate them too badly. With ordinal logit, the assumption in question is proportional odds. We all should probably diagnose our imputation models more, and yes, I've been guilty of not diagnosing them completely in the past.

One other issue in your model is that you xtset your data, implying that it's panel data. Your imputation model doesn't consider that the observations cluster within each country, I think. This is a problem, but there doesn't appear to be a good solution right now. Happened to me as well.

Last edited by Weiwen Ng; 09 May 2017, 09:57.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#8

09 May 2017, 11:01

PMM uses observations in the data to match. I forget if the default is to match with the nearest neighbor, equivalent to the option knn(1) (i.e. k nearest neighbors - 1), but I've come across some guidance saying that you should match from the nearest 10 neighbors or so. (Hence, the bolded knn option above.)

Very good point. In older versions of Stata the default is indeed set to 1 and it produces invalid results. The problem is, that the matching process is supposed to involve a random element which is not the case when there is only one neighbor to match. Paus Allison made this point in his (excellent) blog. I do not believe there is a one number fits best solution, but you certainly should not go with the default in older versions.

Best
Daniel
Comment
Kate Ennis

Join Date: Jan 2018

Posts: 15
#9

19 Jan 2018, 07:17

Originally posted by Andrea Discacciati View Post

In addition to daniel's suggestion.

You could impute JudicialIndep using regress on a logit transform of JudicialIndep and then transform back the imputed values on the original scale. This provided that: JudicialIndep is a continuous variable and that you can assume it follows (at least approximately) a normal distribution on the logit scale.

Hi there, just on note of this comment, i need to transform a variable to logit scale so that I can impute using regress on that varibale so that it is bound between 0 and 1 but I am unsure of what code to use to transform to logit scale. Could anyone offer any advice please?

Thanks
Comment

Announcement

Limiting the range of the values of the variables in multiple imputation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment