Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predicted values after reg by group

    In each of 8,000 schools I'd like to regress y on x1 and x2 and then generate the predicted value of y. The slopes and intercept may be different in each school. This is a little tricky, because as far as I can tell the predict command isn't byable. So I can't do this:
    by school: reg y x1 x2
    by school: predict y_predicted
    This works on small datasets:
    reg y i.school_id##(c.x1 c.x2)
    predict y_predicted
    but gets very slow if the number of schools is large. Any other ideas? I'm wondering if one of the fixed-effects commands would do the trick. Many thanks!
    Paul


  • #2
    Paul:
    what springs to my mind is:
    Code:
    use "https://www.stata-press.com/data/r17/nlswork.dta"
    . g double predict=.
    . forval i=1/5159 {
      2. quietly regress ln_wage age if idcode== `i'
      3. predict fitted, xb
      4. replace predict= fitted if idcode==`i'
      5. drop fitted 
      6.  }
      
    . list idcode ln_wage predict if idcode<=2
    
           +-------------------------------+
           | idcode    ln_wage     predict |
           |-------------------------------|
        1. |      1   1.451214   1.4478402 |
        2. |      1    1.02862   1.5189514 |
        3. |      1   1.589977   1.5900626 |
        4. |      1   1.780273   1.6611738 |
        5. |      1   1.777012   1.8033962 |
           |-------------------------------|
        6. |      1   1.778681   1.9456186 |
        7. |      1   2.493976   2.0167298 |
        8. |      1   2.551715   2.1589522 |
        9. |      1   2.420261   2.3722858 |
       10. |      1   2.614172   2.5145082 |
           |-------------------------------|
       11. |      1   2.536374   2.6567307 |
       12. |      1   2.462927   2.7989531 |
       13. |      2   1.360348   1.4594426 |
       14. |      2   1.206198   1.4868761 |
       15. |      2   1.549883   1.5143095 |
           |-------------------------------|
       16. |      2   1.832581   1.5691764 |
       17. |      2   1.726721   1.6240433 |
       18. |      2    1.68991   1.6514767 |
       19. |      2   1.726964   1.7063437 |
       20. |      2   1.808289   1.7612106 |
           |-------------------------------|
       21. |      2   1.863417    1.788644 |
       22. |      2   1.789367   1.8435109 |
       23. |      2    1.84653   1.8983777 |
       24. |      2   1.856449   1.9532446 |
           +-------------------------------+
    
    .
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks Carlo Lazzaro -- that'll do the trick!

      Bonus question: each time the if clause is used, Stata has to scan the whole dataset to find the relevant subset of the data, which increases runtime. Is there a way to reduce this?
      Last edited by paulvonhippel; 15 May 2023, 08:09.

      Comment


      • #4
        Paul:
        not that I know, unfortunately.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          "if" qualifiers are slow - the above code has 2 - you can cut that in half by using, e.g., -keep- to keep the relevant portion of your data and then appending all at the end

          you may also be able to use the user-written -runby- command which can be found on SSC but I don't know if this will be faster maybe Clyde Schechter (one of the authors) has a comment

          Comment


          • #6
            Command statsby may also work. Perhaps generate the coefficients first, then merge back to the main data and computer the predicted values at one go:

            Code:
            use "https://www.stata-press.com/data/r17/nlswork.dta", clear
            bysort idcode: gen case = _N
            drop if case < 3
            
            preserve
            statsby _b, clear by(idcode): regress ln_wage age
            scalar t2 = c(current_time)
            save tempcoef, replace
            restore
            
            merge m:1 idcode using tempcoef
            
            gen predict2 = _b_cons + _b_age * age

            Comment

            Working...
            X