Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create a loop for regression and extract the beta coef for each user

    Hello, statalist community! I am trying to generate the network effect for every unique user.

    Here is a sample from my dataset, which has over 32 million observations and 13.160,547 unique users.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str17 steamid long appid float(number_of_games friends)
    "76561197960265729"     60 1 1
    "76561197960265730"    340 3 1
    "76561197960265730" 220200 3 1
    "76561197960265730" 235620 3 1
    "76561197960265733"   3320 5 3
    "76561197960265733"     70 5 3
    "76561197960265733"   1500 5 3
    "76561197960265733" 214420 5 3
    "76561197960265733"     30 5 3
    "76561197960265738"     50 3 1
    end
    So the steamid appering several times is not a problem of duplicate, it is because of the different games this id owns.

    What I am trying to do here is to caculate the networkeffect every unique user has by extracting the beta_coef from each
    Code:
    reg number_of_games friends
    which should be run for every user.

    What I have tried is:
    Code:
    foreach i of var steamid{
      2. reg number_of_games friends
      3. gen networkeffect = _b[friends]
      4. }
    But I have only received the same value.

    I think there is something wrong with
    Code:
    foreach i of var steamid
    but I don't know what I can do.

    I really appreciate any help!
    Thanks a lot in advance!
    Ji

  • #2
    foreach in Stata doesn't work as you guess. It does not look inside a variable and then automagically loop over its distinct values.

    A first question is that I don't think you should want what you're trying to do. I guess you don't really want 13 million or so regressions with an average number of 2 to 3 data points (noting that regressions with 1 data point will fail). You might want one regression on people. each counted once.

    Code:
    egen tag = tag(steamid) 
    regress number_of_games friends if tag

    Comment


    • #3
      Thank you Nick. But how do I run this regression for each unique user so that I can extract the beta_coef? Which command can I use?

      Comment


      • #4
        I think you're missing my point, which I did understate, trusting that it would make sense.

        Look again at your own data example. For each identifier shown, the number of friends and the number of games are identical across observations. So, there is a scatter plot for each person, which is a single data point.

        Regression for such cases is -- to use an over-used word -- utterly meaningless. An infinity of straight lines goes through each single data point, but that is of no use or interest.



        Comment


        • #5
          May be this would help illustrating. See this regression results for the second ID in the data example:

          Code:
          . reg number_of_games friends if steamid == "76561197960265730"
          note: friends omitted because of collinearity.
          
                Source |       SS           df       MS      Number of obs   =         3
          -------------+----------------------------------   F(0, 2)         =      0.00
                 Model |           0         0           .   Prob > F        =         .
              Residual |           0         2           0   R-squared       =         .
          -------------+----------------------------------   Adj R-squared   =         .
                 Total |           0         2           0   Root MSE        =         0
          
          ------------------------------------------------------------------------------
          number_of_~s | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
               friends |          0  (omitted)
                 _cons |          3          .        .       .            .           .
          ------------------------------------------------------------------------------
          This is basically useless because:
          1. There is no variability in the dependent variable, so Total Sum of Squares is 0. This means knowing just the number of friend for one app ID within each person is already enough.
          2. There is no variability in the independent variable, and that is a big "no" for linear regression because it'd make the SE of the regression coefficient be undefined.
          That's why there is no point to run the regression 13 million times, you'd just get a column of 0 if there are enough data or a column of "." if there are not enough data (for this case, you'll need 3 per Steam ID).

          But, just for the sake of it. Let me answer the technical part of the question. To run a regression by ID, I usually use statsby:

          Code:
          sysuse nlsw88, clear
          statsby eff = _b[age], by(race) clear: regress wage age
          list

          Comment


          • #6
            I understand it now. Thank you very much!

            Comment

            Working...
            X