Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Saving p-values from linear regression

    Dear all,

    I have asked this question before but wasn't able to succeed.
    Here is a brief version of what I have:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str4 ID float(Time MetricA MetricB)
    "NY01"   0 10 110
    "NY01"  .5 11 120
    "NY01"   1 14 130
    "NY01" 1.5 15 135
    "NY02"   0 13 140
    "NY02"  .5 14 145
    "NY02"   1 14 150
    "NY02" 1.5 16 160
    "NY02"   2 17 175
    end

    Where ID is the patient ID (n ~ 500); Time is the follow-up time (in years) for each patient; and Metrics A to Z are 26 parameters I collected from these patients over time.
    I would like to regress (OLS) each parameter over time and obtain the (i) coefficients, (ii) SE, and (iii) P-values for each patient and each parameter, which I'd like to save as new columns in my database (i.e.: 26 parameters x 3 statistics = 78 new columns).

    I would really appreciate any suggestions here.

    Thanks again!!
    J

  • #2
    Jim:
    I'm surely missing out on something about your post, but I fail to get what you're after.
    You're seemingly dealing with a panel dataset, with patients assessed a different points in time (so you have different waves of data for each patient).
    Hence, you have non-independent observations for each patient. If you want to go OLS (which is, in general, worse than -xtreg- when you have panel data), youn should -cluster()- your standard errors on panelid (patients, in your case).
    If you focus on a variable at time for each patient, you should get something similar to what is reported below:
    Code:
    . bysort ID: regress MetricA Time, vce(cluster ID)
    
    ---------------------------------------------------------------------------------------------------------------------------------
    -> ID = NY01
    
    Linear regression                               Number of obs     =          4
                                                    F(0, 0)           =          .
                                                    Prob > F          =          .
                                                    R-squared         =     0.9529
                                                    Root MSE          =     .63246
    
                                         (Std. Err. adjusted for 1 clusters in ID)
    ------------------------------------------------------------------------------
                 |               Robust
         MetricA |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            Time |        3.6          .        .       .            .           .
           _cons |        9.8          .        .       .            .           .
    ------------------------------------------------------------------------------
    
    ---------------------------------------------------------------------------------------------------------------------------------
    -> ID = NY02
    
    Linear regression                               Number of obs     =          5
                                                    F(0, 0)           =          .
                                                    Prob > F          =          .
                                                    R-squared         =     0.9259
                                                    Root MSE          =      .5164
    
                                         (Std. Err. adjusted for 1 clusters in ID)
    ------------------------------------------------------------------------------
                 |               Robust
         MetricA |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            Time |          2          .        .       .            .           .
           _cons |       12.8          .        .       .            .           .
    ------------------------------------------------------------------------------
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Carlo's code will perform all those calculations, but it will not save the results in your data set. I think the simplest way to do that is:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str4 ID float(Time MetricA MetricB)
      "NY01"   0 10 110
      "NY01"  .5 11 120
      "NY01"   1 14 130
      "NY01" 1.5 15 135
      "NY03"   0 13   .
      "NY03"  .5 14   .
      "NY03"   1 14   .
      "NY03" 1.5 16 160
      "NY02"   0 13 140
      "NY02"  .5 14 145
      "NY02"   1 14 150
      "NY02" 1.5 16 160
      "NY02"   2 17 175
      end
      
      capture program drop metric_regressions
      program define metric_regressions
          foreach v of varlist Metric* {
              capture regress `v' Time
              if inlist(c(rc), 2000, 2001) {
                  noisily display as error "No or insufficient observations for `v', ID = `=ID[1]'
                  foreach x in b se p {
                      gen `x'_`v' = .
                  }
              }
              else if c(rc) != 0 { // UNANTICIPATED PROBLEM
                  error c(rc)
              }
              else {
                  tempname M
                  matrix `M' = r(table)
                  gen b_`v' = `M'[1,1]
                  gen se_`v' = `M'[2,1]
                  gen p_`v' = `M'[4,1]
              }
          }
          exit
      end
      
      runby metric_regressions, by(ID)
      Notes:
      1. You will need to install the -runby- command, written by Robert Picard and me: -ssc install runby-.
      2. In mass regressions like this, it is often the case that some patients will have too few observations to carry out the required regressions. The -capture- in front of the -regress- command and the subsequent -if-else if-else- constructs deal with this problem. The way I have set it up, insufficient observations to do a regression is not treated as an error, and the ID affected will still be included in the regression output, but with missing values for the coefficient, SE and pvalue for whatever variable(s) have insufficient data. If, however, some other error condition is encountered during the regression, that will count as an error, and will be reflectedin the -runby- output accordingly, and that ID will not appear in the results data set.

      Let me also point out that Stata is not a spreadsheet, and its datasets do not have rows and columns. They have observations and variables. If you think about Stata as if it were a spreadsheet, you will eventually be led by your spreadsheet habits to do things that are dysfunctional in Stata. One way of keeping them separate in your mind is to avoid using spreadsheet terminology when referring to Stata.

      Comment


      • #4
        Jim:
        Clyde gave, as usual, an excellent and elaborated advice.
        I agree with Clyde that my previous code would not save the results of -regress- that you were interested in; however, I was warning you about the possible results of what you had in mind: I do not think that focusing on a variable at time for each patient (provided that I'm not mistaken in understanding your research goal) makes statistical sense.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          P-values for each patient and each parameter, which I'd like to save as new columns in my database
          As Carlo wisely remarked, the point of doing this is difficult do understand.

          Perhaps you could inform what you intend to do by getting 78 new columns concerning simple linear regression models. Indeed, this is also arcane to me, as well as getting different p-values for each observation, something I confess I've never heard of and, even to the most radical "frequentist" researcher, I fear say that would potentially sound as preposterous.

          To end, if "for each patient" you mean the Yhat as well as the CIs (and not the p-values), you may type - help predict - and go for that.

          I wouldn't go further than that before knowing its "cui bono".
          Last edited by Marcos Almeida; 29 Dec 2017, 06:47.
          Best regards,

          Marcos

          Comment


          • #6
            Dear Clyde,

            this worked perfectly. I really appreciate your help (again)!
            Also, I appreciate educating me on the inadequate use of "spreadsheet" when using Stata.

            Dear Carlo:

            Sorry I didn't explain in detail.
            The 78 new variables refer to coefficients, SEs, and p-values for each of the 26 parameters (DVs) I'm regressing over Time (IV) [3 x 26 = 78].
            This is a medical experiment in which a series of blood test parameters (n=26) of patients who were followed over time (in years) and I want to know for each patient what parameters changed significantly over that period and in which direction (positive or negative).
            Contrary to what you stated, I'm not getting different p-values for each observation.
            I'm getting p-values (and coefficients) for each blood test parameter for each patient.
            The issue of type-1 error is a relevant one but I already have a plan to deal with it.

            Thanks everyone again for your prompt and helpful responses.

            Best,

            Comment


            • #7
              Considering this is a Stata forum, the assumption is that we all are fully aware of it, but I gather it helps to make a brief comment on two important terms in Stata's lingo: variables are represented by "columns" and observations as "rows".


              Best regards,

              Marcos

              Comment

              Working...
              X