Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • estimation sample varies

    Good day, I get following error message

    Code:
    bys id: gen treatment=0 if (expectation==2 & f.gift_received==2) | (l.expectation==2 & gift_received==2) 
    (4,830 missing values generated)
    
    . bys id: replace treatment=1 if (expectation==2 & f.gift_received==1) | (l.expectation==2 & gift_received==1)
    (1,602 real changes made)
    
    . 
    . gen time=survey==2
    
    sort survey id implicate 
    
    . 
    . mi estimate: reg job_hours treatment time i.time#i.treatment, cluster(treatment) robust
    (system variable _mi_id updated due to changed number of obs.)
    (13 m=0 obs. added to m=1 because physically missing)
    (10 m=0 obs. added to m=2 because physically missing)
    (14 m=0 obs. added to m=3 because physically missing)
    (16 m=0 obs. added to m=4 because physically missing)
    (12 m=0 obs. added to m=5 because physically missing)
    
    estimation sample varies between m=1 and m=2; click here for details
    r(459);
    I don't have access to the data right now, so I can't change anything, just trying to figure out what I did wrong. Since the sample varies within the multiple imputation, I obviously did something wrong. Is the error that I need to sort all variables used in the regression?

    The help page gives me this info:

    Code:
    1.  You are fitting a model on a subsample that changes from one imputation to another.  For example, you specified the if expression containing imputed variables.
    
        2.  Variables used by model-specific estimators contain values varying across imputations.  This results in different sets of observations being used for completed-data analysis.
    
        3.  Variables used in the model (specified directly or used indirectly by the estimator) contain missing values in sets of observations that vary among imputations.  Verify that your mi data are proper and, if necessary, use mi update to update them.
    #2 seems the most probable to me. If the values vary across imputations, that would mean I didn't sort properly correct?

    #3 states that certain imputations for the same observation have or don't have values right? I don't think this should be an issue for me. How would I check for this though? I have a pretty big data set, how could I see if this is the case?

  • #2
    push*

    Comment


    • #3
      You didn't get a quick answer. You'll increase your chances of a useful answer by following the FAQ on asking questions. It is hard to diagnose a data problem when we can't access the data.

      I'm sorry but I don't use mi much and can't help you substantively.

      Comment


      • #4
        I don't have access to the data right now, so I can't change anything, just trying to figure out what I did wrong.
        I think it is necessary to access the data to start the investigation on why the set of available observations differs across imputed datasets.

        (13 m=0 obs. added to m=1 because physically missing)
        (10 m=0 obs. added to m=2 because physically missing)
        (14 m=0 obs. added to m=3 because physically missing)
        (16 m=0 obs. added to m=4 because physically missing)
        (12 m=0 obs. added to m=5 because physically missing)
        First of all, you could see if, for some imputations, the number of “obs. physically missing” is equal to the number of observations itself: in this case, you have that that imputation round basically does not exist in your dataset.
        If the number of imputations is greater than 5 (let’s say, 100) and just the first 5 datasets are missing, you could still get estimates with
        Code:
        mi estimate, imputations(6/100)
        Let’s make the case, however, when you only imputed 5 datasets (it is still useful in case you imputed more, but want to understand what went wrong in the first 5).
        I think you could start by identifying the set of observations that are missing in each dataset. In your case, let’s suppose you have 1,000 observations and 5 imputed datasets. In case of no missing observations, you would have all the combinations of a variable "_mi_m" containg integer numbers from 1 to 5, and of a variable "_mi_id" containg integer numbers from 1 to 1,000. Since “_mi_m” and “_mi_id” are system variables, it’s better to generate equivalent variables we have control on:

        Code:
        generate imp_num=_mi_m
        generate ID=_mi_id
        and to save the file (with imp_num, ID and any other variable you want). You can generate a dataset with all the combinations in this way:

        Code:
        clear
        set obs 5
        generate imp_num = _n
        expand 1000
        bysort imp_num: generate  ID = _n
        Then, by saving this file and merging the two files, you could identify the missing combinations (those with “_merge” equal to 1 or 2, depending on which dataset you used as “master” and which one as “using”).

        At this point, you could generate a dataset with all the missing combinations:
        Code:
        keep if _merge<3
        If you want to see whether there are some observations that have never been imputed and, in general, how many observations (and which) have how many missing values, you could do:
        Code:
        clear
        sort ID
        bysort ID: generate count=_n
        egen max_count=max(count)
        keep if count==max_count
        tab count
        Then, you could save the dataset and, if you want a one with only observations that have never been imputed:

        Code:
        keep if count=5
        Originally posted by Oscar Weinzettl View Post
        The help page gives me this info:

        Code:
        1. You are fitting a model on a subsample that changes from one imputation to another. For example, you specified the if expression containing imputed variables.
        
        2. Variables used by model-specific estimators contain values varying across imputations. This results in different sets of observations being used for completed-data analysis.
        
        3. Variables used in the model (specified directly or used indirectly by the estimator) contain missing values in sets of observations that vary among imputations. Verify that your mi data are proper and, if necessary, use mi update to update them.
        #2 seems the most probable to me. If the values vary across imputations, that would mean I didn't sort properly correct?

        #3 states that certain imputations for the same observation have or don't have values right? I don't think this should be an issue for me. How would I check for this though? I have a pretty big data set, how could I see if this is the case?
        I would start this investigation by learning which imputations did not generate any observations (or are not in the dataset), which observations have never been imputed and which the missing observations are in each imputation.
        Last edited by Federico Tedeschi; 24 Jan 2023, 07:10.

        Comment

        Working...
        X