Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Populating month level data from annual level data for the same observations

    I have a dataset which has both month level data and annual level data over a period of 10 years for the same observations. The annual level data contains an income variable which shows missing (" . ") for monthly-level data. I want to populate this income data using the individual ID variable (CPSPID) for each observation for all years. So for example for year 2015, I want to divide inctot (income) variable by 12 and populate the missing values for month-level data.

    Here's a data example:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double cpsidp byte asecflag int year byte month double inctot
    20220102751001 . 2022  2     .
    20130302944202 2 2013  3     .
    20120402875001 . 2013  6     .
    20210802685602 . 2022  8     .
    20190902824903 . 2019 11     .
    20130602871002 . 2014  8     .
    20170802900702 . 2018  8     .
    20150602887001 . 2016  6     .
    20170202981102 . 2017  5     .
    20110702852901 . 2012  9     .
    20190202877706 . 2020  2     .
    20200402787901 . 2020  7     .
    20121002879501 . 2012 12     .
    20180602842101 . 2018  7     .
    20130702949102 . 2013 10     .
    20180902859801 . 2019 12     .
    20101102813502 . 2012  2     .
    20120602977902 . 2013  8     .
    20160702922302 . 2017  8     .
    20160502962601 . 2016  6     .
    20150202971501 . 2015  2     .
    20130202973601 2 2013  3     .
    20120702955902 . 2012  9     .
    20170802892601 . 2017 10     .
    20110102955903 2 2012  3     .
    20130502897204 . 2014  7     .
    20130702829502 . 2013  9     .
    20170302979603 . 2017  5     .
    20110802811503 . 2012  9     .
    20130802941102 . 2014 10     .
    20200102843902 . 2021  1     .
    20130902879701 . 2014 11     .
    20120502945301 . 2012  6     .
    20160502934602 . 2017  5     .
    20131202960501 . 2015  1     .
    20160102923401 2 2017  3     .
    20160702936902 . 2017  7     .
    20200902755101 . 2020 12     .
    20130402868201 . 2014  6     .
    20150402963207 . 2015  7     .
    20120502900401 . 2013  5     .
    20121102958502 . 2012 11     .
    20170402930901 . 2018  5     .
    20221102774202 . 2022 11     .
    20140402874903 . 2015  7     .
    20130302893002 2 2013  3     .
    20190302840601 1 2019  3 34554
    20170202937501 2 2017  3     .
    20180602851502 . 2019  6     .
    20120302933602 . 2012  6     .
    20190702784703 . 2020  9     .
    20120302899401 . 2012  5     .
    20161002920102 . 2016 11     .
    20110402836703 . 2012  7     .
    20130202966804 2 2013  3     .
    20120402973901 . 2012  7     .
    20130102963102 2 2013  3     .
    20150702893702 . 2015  7     .
    20140402856904 . 2015  7     .
    20141202918801 . 2015  2     .
    20120902967402 . 2012  9     .
    20170802872202 . 2017  8     .
    20170202915001 . 2018  4     .
    20101201698802 . 2012  2     .
    20181102879801 . 2020  2     .
    20130102886401 2 2014  3     .
    20161002929401 . 2017 12     .
    20190102853301 . 2020  4     .
    20180502831901 . 2018  5     .
    20151002908202 . 2015 12     .
    20140902915102 . 2014 12     .
    20130702834303 . 2013 10     .
    20200302783301 . 2021  5     .
    20140402874904 . 2014  7     .
    20170502934801 . 2018  7     .
    20140602951301 . 2015  6     .
    20160602881701 . 2016  9     .
    20130602934801 . 2014  9     .
    20171002835002 . 2018 12     .
    20151002958903 . 2015 10     .
    20140202875004 . 2015  5     .
    20110102906401 . 2012  1     .
    20141202997902 . 2016  2     .
    20110902869601 . 2012 12     .
    20110902940703 . 2012 12     .
    20120702929304 . 2012  9     .
    20170902890202 . 2018 12     .
    20171102845804 . 2017 11     .
    20210702736101 . 2021  7     .
    20120202938902 . 2013  4     .
    20110302909702 1 2012  3   700
    20150402921701 . 2015  6     .
    20131102837001 . 2013 12     .
    20210602716702 . 2022  9     .
    20130102975001 2 2013  3     .
    20210902686001 . 2022 11     .
    20210202789702 . 2021  5     .
    20170802848002 . 2018 10     .
    20171202892601 1 2018  3 54000
    20161002977002 . 2017 10     .
    end
    label values asecflag asecflag_lbl
    label def asecflag_lbl 1 "ASEC", modify
    label def asecflag_lbl 2 "March Basic", modify
    label values month month_lbl
    label def month_lbl 1 "January", modify
    label def month_lbl 2 "February", modify
    label def month_lbl 3 "March", modify
    label def month_lbl 4 "April", modify
    label def month_lbl 5 "May", modify
    label def month_lbl 6 "June", modify
    label def month_lbl 7 "July", modify
    label def month_lbl 8 "August", modify
    label def month_lbl 9 "September", modify
    label def month_lbl 10 "October", modify
    label def month_lbl 11 "November", modify
    label def month_lbl 12 "December", modify

    asecflag is an indicator variable where:
    . (Missing value) - Month-level data
    1 - ASEC data (Annual-level data)
    2- March Basic (March monthly-level data)

    Values for month variable refers to months is the order of the calendar i.e., 1-12 = Jan-Dec

    Is there a good solution for this task?


  • #2
    Your example data leaves the impression that for any given cpsidp, there is only a single observation. So, even where there is a non-missing value in totinc, there are no monthly observations for that person in which to populate a monthly income. So I'm going to assume that this example is not representative of the data set as a whole, but rather that foreach person there is, in fact, a series of monthly observations in the data set--they just aren't showing up in this example. If that's right, and if the monthly income value you want is 1/12th of the total income for that person in that year, you can do this:

    Code:
    by cpsidp year (inctot), sort: assert inlist(inctot, inctot[1], .)
    by cpsidp year (inctot): gen monthly_income = inctot[1]/12
    The first command verifies that for any cpsidp in any year there is only one value of inctot reported--if it is reported more than once in the year, it is the same value each time. This precondition is necessary for it to make any sense to impute a monthly income equal to 1/12th of reported inctot. The second command does the actual calculation and populates a monthly_income variable throughout the year. The key underlying the logic is that when the observations are sorted by cpsidp and year, and inctot within that, because missing values in Stata are greater than any non-missing value, the month that contains the reported value of inctot will end up being the first observation.

    Comment


    • #3
      Thank you very much for the help. This code works but it also generates about 5M missing values in a dataset of 8M so it does not leave me much to work with. Is there any way I can verify that the code does not unnecessarily create missing values (i.e., when values for CPSIDP and inctot are available each year?)

      Comment


      • #4
        Your data extract has exactly one observation per individual cpsidp. Could you provide a better extract with all observations for perhaps a couple cpsidp values, and for each, say a couple years of data?

        Comment


        • #5
          Is there any way I can verify that the code does not unnecessarily create missing values (i.e., when values for CPSIDP and inctot are available each year?
          Code:
          by cpsidp year, sort: egen has_inc_tot = max(!missing(inctot))
          assert !missing(monthly_income) if has_inc_tot
          This will check whether every combination of cpsidp and year that has some non-missing value of inctot has a non-missing monthly_income variable populated in all of that cpsidp year combination's observations. So if all is correct, the -assert- command will produce no output. If something has been missed, it will give an -assertion is false- error message.

          Comment

          Working...
          X