Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Imputation for two time horizons

    Hi,

    a question arose: how can imputations with different time horizons be combined? The big five variables are conducted from 2005 onwards, the others are conducted from 1993 to 2019.

    Thank you in advance!

    Best,
    Vera

    Code:
    *declaring the data to be mi data in mariginal long style (mlong)
    mi set mlong

    *registering variables
    mi register imputed bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m intact_family_m educ_m learn_gp educ_gp
    mi register regular gpa female migrant num_sib north east south west poor_health_m poor_mental_m poor_pcs_p25_m poor_pcs_p10_m poor_mcs_p25_m poor_mcs_p10_m poor_health_dv_m poor_health_dv_p75_m poor_health_dv_p90_m age_birth learn_m only_child_m oldest_m num_sib_m

    mi impute chained (pmm, knn(5)) educ_m learn_gp educ_gp ///
    (logit) intact_family_m = ///
    gpa female migrant num_sib north east west ///
    age_birth learn_m only_child_m oldest_m num_sib_m, ///
    add(20) rseed(1234)
    mi xeq 0 1 20: sum educ_m learn_gp educ_gp intact_family_m

    *imputation: big five
    mi impute chained (pmm, knn(5)) bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m = gpa female migrant num_sib north east west poor_health_m poor_mental_m poor_pcs_p25_m poor_pcs_p10_m poor_mcs_p25_m poor_mcs_p10_m poor_health_dv_m poor_health_dv_p75_m poor_health_dv_p90_m age_birth learn_m only_child_m oldest_m num_sib_m if syear >= 2005, add(20) rseed(1234)
    *descriptive statistics
    mi xeq 0 1 20: sum bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m

  • #2
    I read from your post that your original data is in long format, meaning that there are multiple rows for each unit/person. In this case, you should first convert the dataset to wide, then impute, and the convert it back. Basically like so:

    Code:
    *** Setup of example data ***
    webuse nlswork, clear
    xtset, clear
    keep if inrange(year, 70, 73)
    keep idcode year wks_ue union tenure
    
    *** Imputation ***
    reshape wide wks_ue union tenure, i(idcode) j(year)
    mi set flong
    mi register imputed wks_ue* union* tenure*
    mi impute chained (pmm, knn(5)) wks_ue* union* tenure*, add(3) rseed(123)
    mi reshape long wks_ue union tenure, i(idcode) j(year)
    
    mi estimate: ...
    Best wishes

    Stata 18.0 MP | ORCID | Google Scholar

    Comment


    • #3
      Originally posted by Felix Bittmann View Post
      I read from your post that your original data is in long format, meaning that there are multiple rows for each unit/person. In this case, you should first convert the dataset to wide, then impute, and the convert it back. Basically like so:

      Code:
      *** Setup of example data ***
      webuse nlswork, clear
      xtset, clear
      keep if inrange(year, 70, 73)
      keep idcode year wks_ue union tenure
      
      *** Imputation ***
      reshape wide wks_ue union tenure, i(idcode) j(year)
      mi set flong
      mi register imputed wks_ue* union* tenure*
      mi impute chained (pmm, knn(5)) wks_ue* union* tenure*, add(3) rseed(123)
      mi reshape long wks_ue union tenure, i(idcode) j(year)
      
      mi estimate: ...
      Dear Felix Bittmann,

      thanks for your quick reply. Yes, you are absolutely right. My data is in long format but (I think) the problem is still not solved when I reshape my data. So, lets assume I reshape my data to wide format. The missing values for each variable are the same, e.g.:

      long format:
      pid year income
      1 2000 10
      1 2001 .
      1 2002 3
      by reshaping the data in wide format yields:
      pid income_2000 income_2001 income_2002
      1 10 . 3

      With this in mind, the variables educ_m learn_gp educ_gp intact_family_m are collected from 1993 to 2019 but the variables bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m are observed from 2005 onwards. The missings in wide format are still the same which means that the big five are not collected less then 2005. If I impute all variables for the same time period (1993-2019), then I would impute also missings for the big five which are less then 2005 and imputing variables that are not collected is not recommended, right? So even if I reshape the data I need to perform "two separate imputations" and how do I do it? How do I combine to imputed data sets? Do I store them and merge them? Or how to combine this code even I use the wide format?

      Collected from 1993-2019:
      mi impute chained (pmm, knn(5)) educ_m learn_gp educ_gp ///
      (logit) intact_family_m = ///
      gpa female migrant num_sib north east west ///
      age_birth learn_m only_child_m oldest_m num_sib_m, ///
      add(20) rseed(1234)
      mi xeq 0 1 20: sum educ_m learn_gp educ_gp intact_family_m

      Collected from 2005-2019:
      *imputation: big five
      mi impute chained (pmm, knn(5)) bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m = gpa female migrant num_sib north east west poor_health_m poor_mental_m poor_pcs_p25_m poor_pcs_p10_m poor_mcs_p25_m poor_mcs_p10_m poor_health_dv_m poor_health_dv_p75_m poor_health_dv_p90_m age_birth learn_m only_child_m oldest_m num_sib_m if syear >= 2005, add(20) rseed(1234)

      Hopefully, I evaluate my problem in the right way... I am open for feedback at any time. Thanks in advance!

      Best,
      Vera
      Last edited by Vera Schmidt; Yesterday, 04:20.

      Comment


      • #4
        Lets assume for a second that your data are complete, no missing values present. What is your final analysis model? How do you plan to integrate these different time points into a single model? Answering this question first will help to decide how to impute the missings.
        Best wishes

        Stata 18.0 MP | ORCID | Google Scholar

        Comment


        • #5
          Originally posted by Felix Bittmann View Post
          Lets assume for a second that your data are complete, no missing values present. What is your final analysis model? How do you plan to integrate these different time points into a single model? Answering this question first will help to decide how to impute the missings.
          My analysis will start with simply OLS estimation. Since the big five are really rare, I would like to introduce stepwise sets of characteristics whereby the observations decreases:

          Starting, e.g., with all characteristics that are collected from 1993 to 2019:
          mi estimate: regress gpa female migrant num_sib north east west poor_health_m educ_m age_birth learn_m only_child_m oldest_m num_sib_m intact_family_m learn_gp educ_gp

          *For simplification I only split the characteristics within the big five, the final analysis will separate more characteristics.

          Finally introducing the big five:
          mi estimate: regress gpa female migrant num_sib north east west poor_health_m educ_m age_birth learn_m only_child_m oldest_m num_sib_m intact_family_m learn_gp educ_gp bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m

          *I used mi estimate since we are already talking about imputation...

          I appreciate your help a lot!

          Comment


          • #6
            If we assume there are no missings, your command would become:
            Code:
            regress gpa female migrant num_sib north east west poor_health_m educ_m age_birth learn_m only_child_m oldest_m num_sib_m intact_family_m learn_gp educ_gp
            However, as data are in long format, this will yield no results, as some variables are from the 2005 periode (e.g. poor_health), while others are from the older period, such as educ_gp (or all data points before 2005 are not included). Furthermore, if some variables are measured multiple times for the same individuals, OLS is not adequate and you might want to consider panel regression models (e.g. xtreg). I have the impression that not the imputation is the main concern here but thinking about your substantive analysis model, as this is currently not adequate yet. What is your research interest? Measuring the change of variables over time?
            Best wishes

            Stata 18.0 MP | ORCID | Google Scholar

            Comment

            Working...
            X