Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression when there are many observations with the same values of variables of interest

    Suppose I have a couple of variables (say, y and x) that have the same values when another variable (say, z) has the same value (for example, I have observations for individuals in a country, x and y are characteristics of the city an individual lives in, and z is the city itself).

    I want to regress y on x (so I care about cities, not about individuals), but if I just ask Stata to do it, I will get incorrectly calculated standard errors (there are many individuals in each city, and error terms will be correlated). Is there any simple and standard way to treat all observations with the same z as one while running regression? So to speak, to make Stata use equivalence classes as a sample?

    Thanks.

  • #2
    Not sure that I follow you, but you might be looking for something like
    Code:
    egen byte first = tag(z)
    regress y c.x if first

    Comment


    • #3
      Originally posted by Joseph Coveney View Post
      Not sure that I follow you, but you might be looking for something like
      Code:
      egen byte first = tag(z)
      regress y c.x if first
      Yes, seems like it is it. Thank you!

      Comment


      • #4
        Code:
        //=============================== create some example data
        clear
        set seed 123456
        
        set obs 10 // 10 cities
        gen z = _n
        gen x = rnormal()
        gen y = 2 + 1*x + rnormal(0,.5)
        
        expand 5 // 5 observations in each city
        bys z : gen id = _n
        
        // missing data can always happen so we
        // want a solution that is robust to that
        replace x = . in 2
        replace y = . in 3
        
        
        //============================== solution
        gen byte miss = missing(y,x)
        list in 1/10, sepby(z)
        
        bys z (miss) : gen byte mark = (_n == 1)
        list in 1/10, sepby(z)
        
        reg y x if mark == 1
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          You can view this as individuals nested in cities. I would set this up as a panel dataset, then use a between estimator.

          Code:
          xtset z id
          xtreg y x , be
          Best
          Daniel

          Comment


          • #6
            Just curious: is it better to omit all missing data from the marked-out selection or is it better to have missing data represented in the marked-out selection in roughly the same proportion as in the entire dataset? Does the mechanism of missingness make a difference here?

            Comment


            • #7
              Originally posted by daniel klein View Post
              You can view this as individuals nested in cities. I would set this up as a panel dataset, then use a between estimator.

              Code:
              xtset z id
              xtreg y x , be
              Best
              Daniel
              The original post was a bit ambiguous, but I read it as meaning that y and x, as characteristics of the city, don't vary within a city.
              Edited-to-add: never mind; it does the same thing.
              Last edited by Joseph Coveney; 12 Jan 2017, 03:10.

              Comment

              Working...
              X