Saving p-values from linear regression

Jim Sznajdermann

Join Date: Aug 2017

Posts: 22
#1

Saving p-values from linear regression

28 Dec 2017, 10:18

Dear all,

I have asked this question before but wasn't able to succeed.
Here is a brief version of what I have:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str4 ID float(Time MetricA MetricB) "NY01" 0 10 110 "NY01" .5 11 120 "NY01" 1 14 130 "NY01" 1.5 15 135 "NY02" 0 13 140 "NY02" .5 14 145 "NY02" 1 14 150 "NY02" 1.5 16 160 "NY02" 2 17 175 end

Where ID is the patient ID (n ~ 500); Time is the follow-up time (in years) for each patient; and Metrics A to Z are 26 parameters I collected from these patients over time.
I would like to regress (OLS) each parameter over time and obtain the (i) coefficients, (ii) SE, and (iii) P-values for each patient and each parameter, which I'd like to save as new columns in my database (i.e.: 26 parameters x 3 statistics = 78 new columns).

I would really appreciate any suggestions here.

Thanks again!!
J
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

28 Dec 2017, 10:40

Jim:
I'm surely missing out on something about your post, but I fail to get what you're after.
You're seemingly dealing with a panel dataset, with patients assessed a different points in time (so you have different waves of data for each patient).
Hence, you have non-independent observations for each patient. If you want to go OLS (which is, in general, worse than -xtreg- when you have panel data), youn should -cluster()- your standard errors on panelid (patients, in your case).
If you focus on a variable at time for each patient, you should get something similar to what is reported below:

Code:

. bysort ID: regress MetricA Time, vce(cluster ID)

---------------------------------------------------------------------------------------------------------------------------------
-> ID = NY01

Linear regression                               Number of obs     =          4
                                                F(0, 0)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.9529
                                                Root MSE          =     .63246

                                     (Std. Err. adjusted for 1 clusters in ID)
------------------------------------------------------------------------------
             |               Robust
     MetricA |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        Time |        3.6          .        .       .            .           .
       _cons |        9.8          .        .       .            .           .
------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------------
-> ID = NY02

Linear regression                               Number of obs     =          5
                                                F(0, 0)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.9259
                                                Root MSE          =      .5164

                                     (Std. Err. adjusted for 1 clusters in ID)
------------------------------------------------------------------------------
             |               Robust
     MetricA |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        Time |          2          .        .       .            .           .
       _cons |       12.8          .        .       .            .           .
------------------------------------------------------------------------------

Kind regards,
Carlo
(Stata 19.0)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30112
#3

28 Dec 2017, 11:21

Carlo's code will perform all those calculations, but it will not save the results in your data set. I think the simplest way to do that is:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str4 ID float(Time MetricA MetricB) "NY01" 0 10 110 "NY01" .5 11 120 "NY01" 1 14 130 "NY01" 1.5 15 135 "NY03" 0 13 . "NY03" .5 14 . "NY03" 1 14 . "NY03" 1.5 16 160 "NY02" 0 13 140 "NY02" .5 14 145 "NY02" 1 14 150 "NY02" 1.5 16 160 "NY02" 2 17 175 end capture program drop metric_regressions program define metric_regressions foreach v of varlist Metric* { capture regress `v' Time if inlist(c(rc), 2000, 2001) { noisily display as error "No or insufficient observations for `v', ID = `=ID[1]' foreach x in b se p { gen `x'_`v' = . } } else if c(rc) != 0 { // UNANTICIPATED PROBLEM error c(rc) } else { tempname M matrix `M' = r(table) gen b_`v' = `M'[1,1] gen se_`v' = `M'[2,1] gen p_`v' = `M'[4,1] } } exit end runby metric_regressions, by(ID)

Notes:
1. You will need to install the -runby- command, written by Robert Picard and me: -ssc install runby-.
2. In mass regressions like this, it is often the case that some patients will have too few observations to carry out the required regressions. The -capture- in front of the -regress- command and the subsequent -if-else if-else- constructs deal with this problem. The way I have set it up, insufficient observations to do a regression is not treated as an error, and the ID affected will still be included in the regression output, but with missing values for the coefficient, SE and pvalue for whatever variable(s) have insufficient data. If, however, some other error condition is encountered during the regression, that will count as an error, and will be reflectedin the -runby- output accordingly, and that ID will not appear in the results data set.

Let me also point out that Stata is not a spreadsheet, and its datasets do not have rows and columns. They have observations and variables. If you think about Stata as if it were a spreadsheet, you will eventually be led by your spreadsheet habits to do things that are dysfunctional in Stata. One way of keeping them separate in your mind is to avoid using spreadsheet terminology when referring to Stata.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

29 Dec 2017, 05:18

Jim:
Clyde gave, as usual, an excellent and elaborated advice.
I agree with Clyde that my previous code would not save the results of -regress- that you were interested in; however, I was warning you about the possible results of what you had in mind: I do not think that focusing on a variable at time for each patient (provided that I'm not mistaken in understanding your research goal) makes statistical sense.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

29 Dec 2017, 06:44

P-values for each patient and each parameter, which I'd like to save as new columns in my database

As Carlo wisely remarked, the point of doing this is difficult do understand.

Perhaps you could inform what you intend to do by getting 78 new columns concerning simple linear regression models. Indeed, this is also arcane to me, as well as getting different p-values for each observation, something I confess I've never heard of and, even to the most radical "frequentist" researcher, I fear say that would potentially sound as preposterous.

To end, if "for each patient" you mean the Yhat as well as the CIs (and not the p-values), you may type - help predict - and go for that.

I wouldn't go further than that before knowing its "cui bono".

Last edited by Marcos Almeida; 29 Dec 2017, 06:47.

Best regards,

Marcos
Comment
Jim Sznajdermann

Join Date: Aug 2017

Posts: 22
#6

29 Dec 2017, 08:04

Dear Clyde,

this worked perfectly. I really appreciate your help (again)!
Also, I appreciate educating me on the inadequate use of "spreadsheet" when using Stata.

Dear Carlo:

Sorry I didn't explain in detail.
The 78 new variables refer to coefficients, SEs, and p-values for each of the 26 parameters (DVs) I'm regressing over Time (IV) [3 x 26 = 78].
This is a medical experiment in which a series of blood test parameters (n=26) of patients who were followed over time (in years) and I want to know for each patient what parameters changed significantly over that period and in which direction (positive or negative).
Contrary to what you stated, I'm not getting different p-values for each observation.
I'm getting p-values (and coefficients) for each blood test parameter for each patient.
The issue of type-1 error is a relevant one but I already have a plan to deal with it.

Thanks everyone again for your prompt and helpful responses.

Best,
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

29 Dec 2017, 08:47

Considering this is a Stata forum, the assumption is that we all are fully aware of it, but I gather it helps to make a brief comment on two important terms in Stata's lingo: variables are represented by "columns" and observations as "rows".

Best regards,

Marcos
Comment

Announcement