Regression when there are many observations with the same values of variables of interest

Vasisualiy Lokhankin

Join Date: Jan 2017

Posts: 5
#1

Regression when there are many observations with the same values of variables of interest

12 Jan 2017, 01:39

Suppose I have a couple of variables (say, y and x) that have the same values when another variable (say, z) has the same value (for example, I have observations for individuals in a country, x and y are characteristics of the city an individual lives in, and z is the city itself).

I want to regress y on x (so I care about cities, not about individuals), but if I just ask Stata to do it, I will get incorrectly calculated standard errors (there are many individuals in each city, and error terms will be correlated). Is there any simple and standard way to treat all observations with the same z as one while running regression? So to speak, to make Stata use equivalence classes as a sample?

Thanks.
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#2

12 Jan 2017, 01:55

Not sure that I follow you, but you might be looking for something like

Code:

egen byte first = tag(z) regress y c.x if first
Comment
Vasisualiy Lokhankin

Join Date: Jan 2017

Posts: 5
#3

12 Jan 2017, 02:05

Originally posted by Joseph Coveney View Post

Not sure that I follow you, but you might be looking for something like

Code:

egen byte first = tag(z) regress y c.x if first

Yes, seems like it is it. Thank you!
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3467

12 Jan 2017, 02:11

Code:

//=============================== create some example data
clear
set seed 123456

set obs 10 // 10 cities
gen z = _n
gen x = rnormal()
gen y = 2 + 1*x + rnormal(0,.5)

expand 5 // 5 observations in each city
bys z : gen id = _n

// missing data can always happen so we
// want a solution that is robust to that
replace x = . in 2
replace y = . in 3


//============================== solution
gen byte miss = missing(y,x)
list in 1/10, sepby(z)

bys z (miss) : gen byte mark = (_n == 1)
list in 1/10, sepby(z)

reg y x if mark == 1

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

daniel klein

Join Date: Mar 2014

Posts: 3885
#5

12 Jan 2017, 02:53

You can view this as individuals nested in cities. I would set this up as a panel dataset, then use a between estimator.

Code:

xtset z id xtreg y x , be

Best
Daniel
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#6

12 Jan 2017, 02:59

Just curious: is it better to omit all missing data from the marked-out selection or is it better to have missing data represented in the marked-out selection in roughly the same proportion as in the entire dataset? Does the mechanism of missingness make a difference here?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#7

12 Jan 2017, 03:07

Originally posted by daniel klein View Post

You can view this as individuals nested in cities. I would set this up as a panel dataset, then use a between estimator.

Code:

xtset z id xtreg y x , be

Best
Daniel

The original post was a bit ambiguous, but I read it as meaning that y and x, as characteristics of the city, don't vary within a city.
Edited-to-add: never mind; it does the same thing.

Last edited by Joseph Coveney; 12 Jan 2017, 03:10.
Comment

Announcement

Regression when there are many observations with the same values of variables of interest

Comment

Comment

Comment

Comment

Comment

Comment