Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issues with merging and matching child - parent income data (intergenerational income mobility)

    I will explain my overall objective to begin with and will then state my current situation, in order for my issue to be understood as well as possible. However, this is my first post on here so forgive me if I make any mistakes. I have read all the FAQ’s and as many online resources as possible, but I feel like my problem, is quite specific.

    I am an undergraduate student so my skills with Stata are by no means near expert level. For my thesis, I am exploring intergenerational income mobility, the basic regression I am running is log child income (as an adult) on log parent income. This will look like:

    log(Y1it) = α + βlog(Y0it-1) + εit

    This will provide me with β, which will show how a parents income is correlated with their child's income once they have reached adulthood. Of course, I will be adding more variables in such as education, yet this is besides the fundamental issue i’m facing.

    The dataset I am using (which i’m sure some may be familiar with) is the PSID (Panel Study Of Income Dynamics) which has approximately 40 waves of data running from 1968.

    The PSID has a particular dataset named the FIMS (Family Identification Mapping System) which allows individuals to be matched with their mother and father using a personal identification variable which each person in the dataset is assigned. I have labelled this variable “PID68”.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    
    clear
    
    input float(PID68 FID68 MID68)
    
    2908    . 2002
    
    2909    . 2002
    
    2910    . 2002
    
    2911 2171    .
    
    4001 4906 4907
    
    4002 4908 4909
    
    4003 4001 4002
    
    4004 4001 4002
    
    4005 4001 4002
    
    4006 4001 4002
    
    end
    This is the FIMS data, as observable, individuals are matched with their mother (MID68) / father (FID68).

    I then merged the FIMS data with the child income data via the PID68 unique identifier. This of course reduced the number of observations since only a portion of individuals have parents recorded within the panel study. I used the following code:

    Code:
    merge 1:m PID68 using /Users/jakehumphreys/Desktop/age_income_68-17/C_age_income/C_age_income_cleaned.dta
    
    keep if _merge==3
    This provided the following merged dataset.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    
    clear
    
    input long(Y_C_72 Y_C_73) float(PID68 FID68 MID68) byte _merge
    
        0    0 2171    . 2002 3
    
        0    0 2172    . 2002 3
    
    11140 9100 4001 4906 4907 3
    
        0    0 4002 4908 4909 3
    
     3300 5200 4003 4001 4002 3
    
     1932 4200 4004 4001 4002 3
    
      425 3000 4005 4001 4002 3
    
    end
    
    label values _merge _merge
    
    label def _merge 3 "matched (3)", modify
    Where Y_C_72 is the Income of the individual (child) in 1972, Y_C_73 is the Income of the individual (child) in 1973 etc….

    I then merged this merged dataset with the “father” Dataset, again via the PID68 unique identifier.

    Code:
    merge 1:m PID68 using /Users/jakehumphreys/Desktop/age_income_68-17/C_Income+FIMS.dta, generate(_merge2)
    The reason why I say “father” dataset is because the data isn’t specific for fathers, it includes all Personal Identification numbers.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    
    clear
    
    input long(Y_F_72 Y_F_73) float(PID68 FID68 MID68) long(Y_C_72 Y_C_73) byte(_merge _merge2)
    
        0    0 1001    .    .     .    . . 1
    
     4594 5179 1002    .    .     .    . . 1
    
     5762 8000 1003    .    .     .    . . 1
    
     7694 7000 1004    .    .     .    . . 1
    
        0    0 1030    .    .     .    . . 1
    
        0    0 2001    .    .     .    . . 1
    
        0    0 2002    .    .     .    . . 1
    
        0    0 2170    .    .     .    . . 1
    
        0    0 2171    . 2002     0    0 3 3
    
        0    0 2172    . 2002     0    0 3 3
    
        0    0 2173    .    .     .    . . 1
    
        0    0 3001    .    .     .    . . 1
    
        0    0 3002    .    .     .    . . 1
    
    11140 9100 4001 4906 4907 11140 9100 3 3
    
        0    0 4002 4908 4909     0    0 3 3
    
     3300 5200 4003 4001 4002  3300 5200 3 3
    
     1932 4200 4004 4001 4002  1932 4200 3 3
    
      425 3000 4005 4001 4002   425 3000 3 3
    
        0    0 4006 4001 4002     0    0 3 3
    
        0    0 4007 4001 4002     0    0 3 3
    
    end
    
    label values _merge _merge
    
    label values _merge2 _merge
    
    label def _merge 3 "matched (3)", modify
    
    label def _merge 1 "master only (1)", modify
    This is where the major issue lies. Because I merged via the unique identifier PID68, this means that only the unique personal identification codes are matched, so incomes of the father and the son (for instance) are not on the same row, as shown in the table above. This is probably the biggest problem - one that’s been causing me some issues for a while and i’m starting to panic since its fundamental to my research objective.

    I ideally need to be able to perform the regression described above. I initially tried to solve this by merging the “father” file via FID68, but of course this was not possible since this does not uniquely identify individuals.

    I saw that the below may be a solution to the issue

    “I recommended append (tacking new data to the bottom of your data set). Then you can ask Stata to create your variables "mother's income" and "father's income" using something like the following code:”

    Code:
    fathers_income = [income variable]if PID68 == FID
    I am unsure if the above would work?

    I also saw the possibility of creating a loop variable - similar to above?

    The primary issue is that (as i’m sure is clear now) I need to be able to match average incomes of both father on son (on an individual level) so I can make a panel wide correlation between father and son income levels as stated in my regression / objective (also mothers would be included but the father-son illustration was just for example and sake of ease).

    I also wanted to check that the command

    Code:
    egen Y_mean = rowmean(Y_C_68… )
    would be the correct way to average individual income of the child - I thought this would save me having to have lots of separate income variables.

    And of course, thank-you very much for taking the time out of your day to help me - I sincerely appreciate this
    Last edited by Jake Humphreys; 26 Mar 2020, 16:42.

  • #2
    If you are running version 16, the easiest way to get the data organization you seek is to use frames. Start with the FIMS data in a frame. The use the PID to link to a different frame containing the income data and -frget- the child's income variables. Then drop that link and create a new link using the FID variable, and -frget- the father's income variables. Finally, drop that link and make a new link using the MID variable, and -frget- the mother's income variables. Then you're done. -help frame create-, -help frlink-, -help frget-.

    Comment

    Working...
    X