Assigning state-level values to individual observations by year

Josue Ortega

Join Date: Nov 2020

Posts: 14
#1

Assigning state-level values to individual observations by year

19 Nov 2020, 10:57

Hi, I am working with a large Stata dataset that contains individual-level information (file1.dta), for example
Name State Salary Year Height

John Nebraska 35000 2016 178

Mary Texas 42000 2014 157

I would like to merge this dataset with a state-level file (file2.xls) that contains data on GDP, population, and so on for each state per year. The second file looks like this:
State GDP 2014 GDP 2016 Population 2014 Population 2016

Nebraska 20 25 100 150

Texas 30 35 200 300

The final file should look as follows:
Name State Salary Year Height GDP Population

John Nebraska 35000 2016 178 25 150

Mary Texas 42000 2014 157 30 200

I wonder what is the easiest way to achieve this. Any suggestion would be most welcomed. The individual data file has 42 million observation, so a manual approach is unfeasible.

Thanks.
Tags: lookup, merge, search
Ali Atia

Join Date: May 2020

Posts: 737
#2

19 Nov 2020, 11:15

You first need to reshape file2.xls to long format. One way to do this is to import into Stata using -import excel-. Assuming the variable names are in this format: gdp2014, population2014, etc., after you import them, you can reshape the data using this code:

Code:

reshape long gdp population,i(state) j(year)

You can merge the two datasets once you have file2.dta in long format. In file1:

Code:

merge 1:1 state year using "file2.dta"
1 like
Comment
Josue Ortega

Join Date: Nov 2020

Posts: 14
#3

19 Nov 2020, 11:25

Many thanks for your kind reply Ali. I am a bit lost on how to do the first step. I tried and I got the following error:

Code:

variable state not found

I am not sure what should I be writing instead of i(state). Apologies for the confusion. Thanks.
Comment
Ali Atia

Join Date: May 2020

Posts: 737
#4

19 Nov 2020, 11:27

When you imported file2.xls, you may have kept the variable name as State, instead of state. In that case, you can use

Code:

rename State state

to rename it.
1 like
Comment
Josue Ortega

Join Date: Nov 2020

Posts: 14
#5

20 Nov 2020, 06:07

Many thanks for your help so far. I still get the error "Variable State not found", even though both files have a State variable which seems ok. I attach three screenshots:

Attached Files
Comment
Josue Ortega

Join Date: Nov 2020

Posts: 14
#6

20 Nov 2020, 06:11

See how the variable State is defined in both data sets.
Comment
Ali Atia

Join Date: May 2020

Posts: 737
#7

20 Nov 2020, 07:17

My only thought is that in the last screenshot you seem to be trying to merge using "broadband.dta," but when you sum the State variable you're using "broadbandlong.dta". Perhaps that has something to do with it?

Last edited by Ali Atia; 20 Nov 2020, 07:20.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#8

20 Nov 2020, 07:41

I cannot read your screenshots (please read the FAQ) but I notice that in your command you type "state" in lower case, while in #6 you type "State" - Stata will treat these as different
Comment
Josue Ortega

Join Date: Nov 2020

Posts: 14
#9

20 Nov 2020, 10:22

Thanks Ali, again most helpful. I have a final question that is somewhat independent of this. I managed to save the file with individual-level variables and state=level variables. I have millions of observations for each US state for around 30 years. Some variables are the same for all people in the state for a year, such as temperature.

I want to run now a regression where the unit of observation is the temperature at each year and at each state, is explained by variables that are variant across state but not time (say dimension of state) and some variables that are varying across state and years (say population density). How would you run such regression? Ideally it would be something like (here is a very naive approach):

reg temperature dimensionstate population density

Thanks for your helpful comments.
Comment
Josue Ortega

Join Date: Nov 2020

Posts: 14
#10

20 Nov 2020, 10:22

Thanks Ali, again most helpful. I have a final question that is somewhat independent of this. I managed to save the file with individual-level variables and state=level variables. I have millions of observations for each US state for around 30 years. Some variables are the same for all people in the state for a year, such as temperature.

I want to run now a regression where the unit of observation is the temperature at each year and at each state, is explained by variables that are variant across state but not time (say dimension of state) and some variables that are varying across state and years (say population density). How would you run such regression? Ideally, it would be something like (here is a very naive approach):

reg temperature dimensionstate population density

Thanks for your helpful comments.
Comment

Name	State	Salary	Year	Height
John	Nebraska	35000	2016	178
Mary	Texas	42000	2014	157

State	GDP 2014	GDP 2016	Population 2014	Population 2016
Nebraska	20	25	100	150
Texas	30	35	200	300

Name	State	Salary	Year	Height	GDP	Population
John	Nebraska	35000	2016	178	25	150
Mary	Texas	42000	2014	157	30	200

Announcement

Assigning state-level values to individual observations by year

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment