Dear all,
I am trying to predict individuals’ income at the age of 38 using 18 years' worth of data for people who were born in 1978,1979, and 1980 on many variables such as family income, location, family members’ education, etc.(there are at least 50 variables like that) The goal is to see which stage of the childhood better predicts income in the labor market at age 38. As one can infer from the problem, my outcome variable (income at age 38) does not vary with time but vary across individuals as well as birth cohorts and the covariates may or may not vary with time, birth cohorts, and across individuals. I believe I can not use panel models as the outcome variable is time-invariant. How should I think about modeling this? It will be a prediction exercise. Should I do a regression? Any other ML models I could use?
A small subset of the dataset looks like this (There are definitely more variables than just in the example): The dependent variable is labinc38, which is the labor income of an individual at the age of 38.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float id int year float(birthyear labinc38) byte sex float(nc_mother birthweight) byte mar_mo_birth float(ordertofather nc_father) byte hhomeowner_ float nrooms_
7182 1978 1978 0 2 99 999 9 99 99 5 4
7182 1979 1978 0 2 99 999 9 99 99 5 4
7182 1980 1978 0 2 99 999 9 99 99 5 9
7182 1981 1978 0 2 99 999 9 99 99 5 6
7182 1982 1978 0 2 99 999 9 99 99 5 8
7182 1983 1978 0 2 99 999 9 99 99 5 8
7182 1984 1978 0 2 99 999 9 99 99 5 8
7182 1985 1978 0 2 99 999 9 99 99 5 8
7182 1986 1978 0 2 99 999 9 99 99 5 8
7182 1987 1978 0 2 99 999 9 99 99 5 8
7182 1988 1978 0 2 99 999 9 99 99 5 8
7182 1989 1978 0 2 99 999 9 99 99 5 8
7182 1990 1978 0 2 99 999 9 99 99 5 8
7182 1991 1978 0 2 99 999 9 99 99 5 8
7182 1992 1978 0 2 99 999 9 99 99 5 7
7182 1993 1978 0 2 99 999 9 99 99 5 7
7182 1994 1978 0 2 99 999 9 99 99 5 8
7182 1995 1978 0 2 99 999 9 99 99 5 4
7182 1996 1978 0 2 99 999 9 99 99 5 4
7182 1997 1978 0 2 99 999 9 99 99 5 8
7182 1999 1978 0 2 99 999 9 99 99 5 2
18209 1978 1978 100000 1 99 999 9 99 99 5 4
18209 1979 1978 100000 1 99 999 9 99 99 5 4
18209 1980 1978 100000 1 99 999 9 99 99 1 6
18209 1981 1978 100000 1 99 999 9 99 99 1 5
18209 1982 1978 100000 1 99 999 9 99 99 1 6
18209 1983 1978 100000 1 99 999 9 99 99 1 7
18209 1984 1978 100000 1 99 999 9 99 99 1 6
18209 1985 1978 100000 1 99 999 9 99 99 1 6
18209 1986 1978 100000 1 99 999 9 99 99 1 6
18209 1987 1978 100000 1 99 999 9 99 99 1 6
18209 1988 1978 100000 1 99 999 9 99 99 8 6
18209 1989 1978 100000 1 99 999 9 99 99 8 4
18209 1990 1978 100000 1 99 999 9 99 99 8 2
18209 1991 1978 100000 1 99 999 9 99 99 5 6
18209 1992 1978 100000 1 99 999 9 99 99 5 6
18209 1993 1978 100000 1 99 999 9 99 99 5 5
18209 1994 1978 100000 1 99 999 9 99 99 5 5
18209 1995 1978 100000 1 99 999 9 99 99 8 5
18209 1996 1978 100000 1 99 999 9 99 99 8 99
18209 1997 1978 100000 1 99 999 9 99 99 5 4
18209 1999 1978 100000 1 99 999 9 99 99 5 5
41030 1978 1978 57800 2 2 998 1 1 2 1 4
41030 1979 1978 57800 2 2 998 1 1 2 1 4
41030 1980 1978 57800 2 2 998 1 1 2 1 4
41030 1981 1978 57800 2 2 998 1 1 2 1 4
41030 1982 1978 57800 2 2 998 1 1 2 1 4
41030 1983 1978 57800 2 2 998 1 1 2 1 4
41030 1984 1978 57800 2 2 998 1 1 2 1 4
41030 1985 1978 57800 2 2 998 1 1 2 5 2
41030 1986 1978 57800 2 2 998 1 1 2 . .
41030 1987 1978 57800 2 2 998 1 1 2 . .
41030 1988 1978 57800 2 2 998 1 1 2 . .
41030 1989 1978 57800 2 2 998 1 1 2 . .
41030 1990 1978 57800 2 2 998 1 1 2 . .
41030 1991 1978 57800 2 2 998 1 1 2 . .
41030 1992 1978 57800 2 2 998 1 1 2 . .
41030 1993 1978 57800 2 2 998 1 1 2 5 5
41030 1994 1978 57800 2 2 998 1 1 2 5 5
41030 1995 1978 57800 2 2 998 1 1 2 5 5
41030 1996 1978 57800 2 2 998 1 1 2 5 5
41030 1997 1978 57800 2 2 998 1 1 2 5 4
41030 1999 1978 57800 2 2 998 1 1 2 5 3
41183 1978 0 66500 2 2 998 2 1 2 1 7
41183 1979 0 66500 2 2 998 2 1 2 . .
41183 1980 0 66500 2 2 998 2 1 2 . .
41183 1981 0 66500 2 2 998 2 1 2 . .
41183 1982 0 66500 2 2 998 2 1 2 . .
41183 1983 0 66500 2 2 998 2 1 2 . .
41183 1984 0 66500 2 2 998 2 1 2 . .
41183 1985 0 66500 2 2 998 2 1 2 . .
41183 1986 0 66500 2 2 998 2 1 2 . .
41183 1987 0 66500 2 2 998 2 1 2 . .
41183 1988 0 66500 2 2 998 2 1 2 . .
41183 1989 0 66500 2 2 998 2 1 2 . .
41183 1990 0 66500 2 2 998 2 1 2 . .
41183 1991 0 66500 2 2 998 2 1 2 . .
41183 1992 0 66500 2 2 998 2 1 2 . .
41183 1993 0 66500 2 2 998 2 1 2 1 5
41183 1994 0 66500 2 2 998 2 1 2 5 5
41183 1995 0 66500 2 2 998 2 1 2 1 5
41183 1996 0 66500 2 2 998 2 1 2 1 5
41183 1997 0 66500 2 2 998 2 1 2 1 5
41183 1999 0 66500 2 2 998 2 1 2 1 7
41197 1978 1978 116000 2 99 999 9 99 99 1 7
41197 1979 1978 116000 2 99 999 9 99 99 . .
41197 1980 1978 116000 2 99 999 9 99 99 . .
41197 1981 1978 116000 2 99 999 9 99 99 . .
41197 1982 1978 116000 2 99 999 9 99 99 . .
41197 1983 1978 116000 2 99 999 9 99 99 . .
41197 1984 1978 116000 2 99 999 9 99 99 . .
41197 1985 1978 116000 2 99 999 9 99 99 . .
41197 1986 1978 116000 2 99 999 9 99 99 . .
41197 1987 1978 116000 2 99 999 9 99 99 . .
41197 1988 1978 116000 2 99 999 9 99 99 . .
41197 1989 1978 116000 2 99 999 9 99 99 . .
41197 1990 1978 116000 2 99 999 9 99 99 . .
41197 1991 1978 116000 2 99 999 9 99 99 . .
41197 1992 1978 116000 2 99 999 9 99 99 . .
41197 1993 1978 116000 2 99 999 9 99 99 1 5
end
label values sex ER32000L
label def ER32000L 1 "Male", modify
label def ER32000L 2 "Female", modify
label values nc_mother ER32012L
label def ER32012L 2 "Total number of children", modify
label def ER32012L 99 "No information was gathered in 1983-1984 or 1985-2019 about identity of mother (ER32009=0 or ER32010=999); values for ER32009-ER32010 obtained only from 1983 and 1984 coding (ER32009=>0 and ER32011=0)", modify
label values mar_mo_birth ER32015L
label def ER32015L 1 "Married", modify
label def ER32015L 2 "Never married", modify
label def ER32015L 9 "No information was gathered in 1983-1984 or 1985-2019 about identity of mother (ER32009=0 or ER32010=999); values for ER32009-ER32010 obtained only from 1983 and 1984 coding (ER32009=>0 and ER32011=0)", modify
label values ordertofather ER32020L
label def ER32020L 1 "All births to father of this individual are mentioned, and none of the birthdates contain missing data. Twins or other multiple births were randomly assigned consecutive codes.", modify
label def ER32020L 99 "No information was gathered in 1985-2019 about identity of father (ER32016=0)", modify
label values nc_father ER32019L
label def ER32019L 2 "Total number children", modify
label def ER32019L 99 "No information was gathered in 1985-2019 about identity of father (ER32016=0)", modify
label values hhomeowner_ ER13040L
label def ER13040L 1 "Owns or is buying home, either fully or jointly; mobile home owners who rent lots are included here", modify
label def ER13040L 5 "Pays rent", modify
label def ER13040L 8 "Neither owns nor rents", modify
label values nrooms_ ER13037L
label def ER13037L 2 "Actual number", modify
label def ER13037L 3 "Actual number", modify
label def ER13037L 4 "Actual number", modify
label def ER13037L 5 "Actual number", modify
label def ER13037L 6 "Actual number", modify
label def ER13037L 7 "Actual number", modify
label def ER13037L 8 "Actual number", modify
label def ER13037L 9 "Actual number", modify
label def ER13037L 99 "NA; refused", modify
[/CODE]
Thank you for your time and consideration.
Best,
Aman
I am trying to predict individuals’ income at the age of 38 using 18 years' worth of data for people who were born in 1978,1979, and 1980 on many variables such as family income, location, family members’ education, etc.(there are at least 50 variables like that) The goal is to see which stage of the childhood better predicts income in the labor market at age 38. As one can infer from the problem, my outcome variable (income at age 38) does not vary with time but vary across individuals as well as birth cohorts and the covariates may or may not vary with time, birth cohorts, and across individuals. I believe I can not use panel models as the outcome variable is time-invariant. How should I think about modeling this? It will be a prediction exercise. Should I do a regression? Any other ML models I could use?
A small subset of the dataset looks like this (There are definitely more variables than just in the example): The dependent variable is labinc38, which is the labor income of an individual at the age of 38.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float id int year float(birthyear labinc38) byte sex float(nc_mother birthweight) byte mar_mo_birth float(ordertofather nc_father) byte hhomeowner_ float nrooms_
7182 1978 1978 0 2 99 999 9 99 99 5 4
7182 1979 1978 0 2 99 999 9 99 99 5 4
7182 1980 1978 0 2 99 999 9 99 99 5 9
7182 1981 1978 0 2 99 999 9 99 99 5 6
7182 1982 1978 0 2 99 999 9 99 99 5 8
7182 1983 1978 0 2 99 999 9 99 99 5 8
7182 1984 1978 0 2 99 999 9 99 99 5 8
7182 1985 1978 0 2 99 999 9 99 99 5 8
7182 1986 1978 0 2 99 999 9 99 99 5 8
7182 1987 1978 0 2 99 999 9 99 99 5 8
7182 1988 1978 0 2 99 999 9 99 99 5 8
7182 1989 1978 0 2 99 999 9 99 99 5 8
7182 1990 1978 0 2 99 999 9 99 99 5 8
7182 1991 1978 0 2 99 999 9 99 99 5 8
7182 1992 1978 0 2 99 999 9 99 99 5 7
7182 1993 1978 0 2 99 999 9 99 99 5 7
7182 1994 1978 0 2 99 999 9 99 99 5 8
7182 1995 1978 0 2 99 999 9 99 99 5 4
7182 1996 1978 0 2 99 999 9 99 99 5 4
7182 1997 1978 0 2 99 999 9 99 99 5 8
7182 1999 1978 0 2 99 999 9 99 99 5 2
18209 1978 1978 100000 1 99 999 9 99 99 5 4
18209 1979 1978 100000 1 99 999 9 99 99 5 4
18209 1980 1978 100000 1 99 999 9 99 99 1 6
18209 1981 1978 100000 1 99 999 9 99 99 1 5
18209 1982 1978 100000 1 99 999 9 99 99 1 6
18209 1983 1978 100000 1 99 999 9 99 99 1 7
18209 1984 1978 100000 1 99 999 9 99 99 1 6
18209 1985 1978 100000 1 99 999 9 99 99 1 6
18209 1986 1978 100000 1 99 999 9 99 99 1 6
18209 1987 1978 100000 1 99 999 9 99 99 1 6
18209 1988 1978 100000 1 99 999 9 99 99 8 6
18209 1989 1978 100000 1 99 999 9 99 99 8 4
18209 1990 1978 100000 1 99 999 9 99 99 8 2
18209 1991 1978 100000 1 99 999 9 99 99 5 6
18209 1992 1978 100000 1 99 999 9 99 99 5 6
18209 1993 1978 100000 1 99 999 9 99 99 5 5
18209 1994 1978 100000 1 99 999 9 99 99 5 5
18209 1995 1978 100000 1 99 999 9 99 99 8 5
18209 1996 1978 100000 1 99 999 9 99 99 8 99
18209 1997 1978 100000 1 99 999 9 99 99 5 4
18209 1999 1978 100000 1 99 999 9 99 99 5 5
41030 1978 1978 57800 2 2 998 1 1 2 1 4
41030 1979 1978 57800 2 2 998 1 1 2 1 4
41030 1980 1978 57800 2 2 998 1 1 2 1 4
41030 1981 1978 57800 2 2 998 1 1 2 1 4
41030 1982 1978 57800 2 2 998 1 1 2 1 4
41030 1983 1978 57800 2 2 998 1 1 2 1 4
41030 1984 1978 57800 2 2 998 1 1 2 1 4
41030 1985 1978 57800 2 2 998 1 1 2 5 2
41030 1986 1978 57800 2 2 998 1 1 2 . .
41030 1987 1978 57800 2 2 998 1 1 2 . .
41030 1988 1978 57800 2 2 998 1 1 2 . .
41030 1989 1978 57800 2 2 998 1 1 2 . .
41030 1990 1978 57800 2 2 998 1 1 2 . .
41030 1991 1978 57800 2 2 998 1 1 2 . .
41030 1992 1978 57800 2 2 998 1 1 2 . .
41030 1993 1978 57800 2 2 998 1 1 2 5 5
41030 1994 1978 57800 2 2 998 1 1 2 5 5
41030 1995 1978 57800 2 2 998 1 1 2 5 5
41030 1996 1978 57800 2 2 998 1 1 2 5 5
41030 1997 1978 57800 2 2 998 1 1 2 5 4
41030 1999 1978 57800 2 2 998 1 1 2 5 3
41183 1978 0 66500 2 2 998 2 1 2 1 7
41183 1979 0 66500 2 2 998 2 1 2 . .
41183 1980 0 66500 2 2 998 2 1 2 . .
41183 1981 0 66500 2 2 998 2 1 2 . .
41183 1982 0 66500 2 2 998 2 1 2 . .
41183 1983 0 66500 2 2 998 2 1 2 . .
41183 1984 0 66500 2 2 998 2 1 2 . .
41183 1985 0 66500 2 2 998 2 1 2 . .
41183 1986 0 66500 2 2 998 2 1 2 . .
41183 1987 0 66500 2 2 998 2 1 2 . .
41183 1988 0 66500 2 2 998 2 1 2 . .
41183 1989 0 66500 2 2 998 2 1 2 . .
41183 1990 0 66500 2 2 998 2 1 2 . .
41183 1991 0 66500 2 2 998 2 1 2 . .
41183 1992 0 66500 2 2 998 2 1 2 . .
41183 1993 0 66500 2 2 998 2 1 2 1 5
41183 1994 0 66500 2 2 998 2 1 2 5 5
41183 1995 0 66500 2 2 998 2 1 2 1 5
41183 1996 0 66500 2 2 998 2 1 2 1 5
41183 1997 0 66500 2 2 998 2 1 2 1 5
41183 1999 0 66500 2 2 998 2 1 2 1 7
41197 1978 1978 116000 2 99 999 9 99 99 1 7
41197 1979 1978 116000 2 99 999 9 99 99 . .
41197 1980 1978 116000 2 99 999 9 99 99 . .
41197 1981 1978 116000 2 99 999 9 99 99 . .
41197 1982 1978 116000 2 99 999 9 99 99 . .
41197 1983 1978 116000 2 99 999 9 99 99 . .
41197 1984 1978 116000 2 99 999 9 99 99 . .
41197 1985 1978 116000 2 99 999 9 99 99 . .
41197 1986 1978 116000 2 99 999 9 99 99 . .
41197 1987 1978 116000 2 99 999 9 99 99 . .
41197 1988 1978 116000 2 99 999 9 99 99 . .
41197 1989 1978 116000 2 99 999 9 99 99 . .
41197 1990 1978 116000 2 99 999 9 99 99 . .
41197 1991 1978 116000 2 99 999 9 99 99 . .
41197 1992 1978 116000 2 99 999 9 99 99 . .
41197 1993 1978 116000 2 99 999 9 99 99 1 5
end
label values sex ER32000L
label def ER32000L 1 "Male", modify
label def ER32000L 2 "Female", modify
label values nc_mother ER32012L
label def ER32012L 2 "Total number of children", modify
label def ER32012L 99 "No information was gathered in 1983-1984 or 1985-2019 about identity of mother (ER32009=0 or ER32010=999); values for ER32009-ER32010 obtained only from 1983 and 1984 coding (ER32009=>0 and ER32011=0)", modify
label values mar_mo_birth ER32015L
label def ER32015L 1 "Married", modify
label def ER32015L 2 "Never married", modify
label def ER32015L 9 "No information was gathered in 1983-1984 or 1985-2019 about identity of mother (ER32009=0 or ER32010=999); values for ER32009-ER32010 obtained only from 1983 and 1984 coding (ER32009=>0 and ER32011=0)", modify
label values ordertofather ER32020L
label def ER32020L 1 "All births to father of this individual are mentioned, and none of the birthdates contain missing data. Twins or other multiple births were randomly assigned consecutive codes.", modify
label def ER32020L 99 "No information was gathered in 1985-2019 about identity of father (ER32016=0)", modify
label values nc_father ER32019L
label def ER32019L 2 "Total number children", modify
label def ER32019L 99 "No information was gathered in 1985-2019 about identity of father (ER32016=0)", modify
label values hhomeowner_ ER13040L
label def ER13040L 1 "Owns or is buying home, either fully or jointly; mobile home owners who rent lots are included here", modify
label def ER13040L 5 "Pays rent", modify
label def ER13040L 8 "Neither owns nor rents", modify
label values nrooms_ ER13037L
label def ER13037L 2 "Actual number", modify
label def ER13037L 3 "Actual number", modify
label def ER13037L 4 "Actual number", modify
label def ER13037L 5 "Actual number", modify
label def ER13037L 6 "Actual number", modify
label def ER13037L 7 "Actual number", modify
label def ER13037L 8 "Actual number", modify
label def ER13037L 9 "Actual number", modify
label def ER13037L 99 "NA; refused", modify
[/CODE]
Thank you for your time and consideration.
Best,
Aman
