Programming Question - how to find the max R^2 over many regressions?

Gregory Grant

Join Date: May 2016

Posts: 20
#1

Programming Question - how to find the max R^2 over many regressions?

12 May 2016, 14:35

I have gene expression data from 40 individuals (observations) and 2,000 variables (we measured the gene expression genome-wide, so for 90,000 genes in total, but I'm only allowed 2,000 variables in this stingy version of Stata - and it looks like no version can handle 90K vars anyway...). I also have "age" as a variable.

What I want to do is to compute the regression of age with the expression of each of the 2,000 genes and to find the gene with the highest R^2. Or better yet to rank them by their R^2.

Obviously I need a loop over all variables. I've tried many things but I just can't get the syntax right for this.

Any help would be greatly appreciated.

Also, I'd like to say that high throughput biology is the norm now, so that very few of our data sets have less than 32,767 variables. Therefore Stata has basically taken itself out of the game with this limitation.

Thank you, Greg
Tags: None
Aspen Chen

Join Date: Apr 2014

Posts: 114
#2

12 May 2016, 15:13

It'd help if you could provide a toy example. I am not familiar with how gene expressions are measured and might not fully comprehend the problem, Still, below are two ways which I thought could be a start.

The first uses a wide format data, runs a loop, and stores the R-squared values in a matrix. The second reshapes the data into long format and uses statsby to run the regressions and to extract the R-squared values. With many more variables, I think statsby would have a noticeable edge over the loop in terms of speed.

Stata can take billions of cases with sufficient memory. So if you can get the data in long format, I think you should be able to fit all of your variables.

Code:

clear *** Wide format method input id age g1 g2 g3 1 40 1 0 1 2 21 1 0 0 3 72 0 1 1 4 56 0 0 0 5 48 1 1 0 end mat R=J(3,1,.) forval i=1/3 { reg age g`i' mat R[`i',1]=`e(r2)' } mat list R *** Long format method reshape long g, i(id) j(genenum) // wide to long statsby r=e(r2), by(genenum) clear:reg age g list

Last edited by Aspen Chen; 12 May 2016, 15:17.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3886

13 May 2016, 04:43

If this is about time efficiency, I would guess the reshaping part will take quite a while and thus might not be compensated by statsby. Also regress is fast, but I would guess correlate is faster. Since there are only two variables there is no need for regression.

Since calculation is so trivial in this case, I would be tempted to shift the thing (especially the looping) into Mata completely. Consider:

Code:

*! version 1.0.0 13may2016 daniel klein
program sorted_r2 , rclass
    version 12.1
    syntax varlist(numeric min = 2)
    gettoken yvar xvars : varlist
    
    tempname R2
    mata : sorted_r2("`yvar'", "`xvars'", "`R2'")
    
    return  matrix R_2 `R2'
end

version 12.1

mata :

void sorted_r2(string scalar yvar,
            string scalar xvars,
            string scalar matname)
{
    real colvector y, idx
    real matrix X, R2
    
    st_view(y = ., ., yvar)
    st_view(X = ., ., xvars)
    R2 = J(cols(X), 1, .)
    
    for (i = 1; i <= rows(R2); ++i) {
        R2[i] = corr(variance((y, X[., i])))[2, 1]^2
    }
    idx = sort((R2, (1::rows(R2))), -1)[., 2]
    
    st_matrix(matname, R2[idx])
    st_matrixcolstripe(matname, ("", "R_2"))
    st_matrixrowstripe(matname, ///
        (J(rows(R2), 1, ""), tokens(xvars)[idx]'))
}

end

Then, with the toy data above

Code:

sorted_r2 age g*
return list
matlist r(R_2)

Note that this assumes no missing values.

Best
Daniel

Last edited by daniel klein; 13 May 2016, 04:59.

Comment

Gregory Grant

Join Date: May 2016

Posts: 20
#4

15 May 2016, 07:52

Thank you both for your replies! I realize a few things by reading them. First, Stata is difficult and I'm very far from understanding your answers. I'm not sure how you guys learned to program like that because none of the online tutorials and such that I find could get me there. I was hoping something natural was possible. Like the following pseudocode:

Code:
max = 0
foreach x in vars
regress age x
if e(r2) > max
max = e(r2)
display max

Instead there seems to be an incredible amount of cryptic bookkeeping and unintuitive syntax and data structures involved in Stata. I guess they want to cultivate a very elite set of guru-types where mortals have no access. I am sure I could write a script in Perl from scratch to do this faster than I could figure out the stata syntax...

I'm pretty discouraged in this moment but hopefully it won't send me packing... :-(

Thanks again for your time!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#5

15 May 2016, 10:27

Although I have never personally met Daniel Klein or Aspen Chen, I'm pretty sure that they are as mortal as I am, and I can do things like this as well. Yes there is a learning curve for Stata. For me, at least, Stata is a lot more intuitive and was much easier to learn, than SAS (which I quit using over 20 years ago.) And no business enterprise would benefit by structuring its product so that it could only be used by an elite set of guru-types.

In fact, in most situations like yours, the solution to this problem would be quite similar to the pseudo-code you show in #4. The only reason it doesn't work here is because you have too many variables to load into memory at once. Aspen Chen's long-format method deals with that problem and it doesn't contain anything arcane. Even when you are not working with enormous numbers of variables, the -reshape- command is a basic for data-management in Stata, and though it takes a little getting used to, it's quite simple to use once you have the hang of it. -statsby- then basically incorporates a variation on your pseudocode, where the different variables are "stacked vertically," without your having to write it out.

It sounds like you would like a resource that can get you beyond the very basics that you can glean from the on-line videos into more advanced Stata use. You might try Christopher F. Baum, An Introduction to Stata Programming, 2nd ed. available from Stata Press. If you can program in Perl, I am quite confident you have what it takes to learn to program in Stata.
1 like
Comment
Gregory Grant

Join Date: May 2016

Posts: 20
#6

15 May 2016, 12:33

Thank you for your encouragement and the pointer, I just ordered that book - although the $13 shipping charge almost stopped me. It sounds like it's not a ground-up treatment, so I hope I have enough background to read it. I'll let you know. It will really help me at work to get a handle on this. Right now we use Prism and R and I really do not like R. Thanks again!
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#7

15 May 2016, 13:11

Gregory - your point regarding the number of variables (columns) now common in biology is very interesting. It made me think - why does Stata have different limits for the amount of observations (rows) and variables (columns). A more reasonable limit in my view is the size of the data-table as a product of columns and rows.. what do you think?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

15 May 2016, 17:17

Instead there seems to be an incredible amount of cryptic bookkeeping and unintuitive syntax and data structures involved in Stata. I guess they want to cultivate a very elite set of guru-types where mortals have no access.

Odd, that's pretty much how I feel about the reading I've tried doing to understand contemporary genetics.

When I began using Stata in a serious way, I started by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through Stata's Help menu. The objective in doing this was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax.

I can't say I've ever bothered searching for online tutorials; the documentation and Statalist do fine for me. I began by reading Statalist to test my knowledge, and exercise my ability to find solutions to problems I had not previously encountered.

You should take the time to read the fine manuals, that's the best advice I can offer. Stata's documentation is exceptional.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

15 May 2016, 17:30

To follow up on the question asked at Post #7, each variable in a dataset has with it an attendant amount of overhead in memory and in processing time to keep track of its characteristics, to locate it from among all the other variables (especially when abbreviations and/or wild cards are allowed), and so on. Limiting the number of variables allows the code to use a fixed upper limit on the data structures involved in processing the dataset. The same issues do not arise for observations.

Last edited by William Lisowski; 15 May 2016, 18:00.
1 like
Comment
Gregory Grant

Join Date: May 2016

Posts: 20
#10

16 May 2016, 05:37

Originally posted by William Lisowski View Post

Odd, that's pretty much how I feel about the reading I've tried doing to understand contemporary genetics.

That's true, I get asked a lot for a good book but the field is changing so fast that any book is out-of-date almost immediately. Because of advances in technology, the field is has become completely dependent on computers and statistics, but traditionally biology was what you majored in if you love science but you hate math and computers. Unfortunately the cows have come home to roost and now biologists cannot function without computers and statistics. Instead of learning these things most biologists just collaborate with programmers and statisticians. And that's created quite a cultural divide in the field. And it's easy to find programmers but it's almost impossible to find statisticians who can do this stuff because the need is so extreme and such a small percent of people are good at math. If you have a handle on the statistical analysis of high-throughput biological data you can basically write your own ticket these days... and mastering something like Stata is a huge leg up, most people use R and/or something like Prism.
Comment
Gregory Grant

Join Date: May 2016

Posts: 20
#11

16 May 2016, 05:42

Originally posted by William Lisowski View Post

To follow up on the question asked at Post #7, each variable in a dataset has with it an attendant amount of overhead in memory and in processing time to keep track of its characteristics, to locate it from among all the other variables (especially when abbreviations and/or wild cards are allowed), and so on. Limiting the number of variables allows the code to use a fixed upper limit on the data structures involved in processing the dataset. The same issues do not arise for observations.

Thank you for this explanation. Unfortunately to keep up with modern biology Stata is going to have to increase the number of variables beyond 32,767. Sometimes we tile the entire genome (which has 3.5 billion bases) with 50 or 100 base windows, so we have like 35 to 70 million variables. Many of those might be all zeros so we can discard them. But you still end up with some millions of variables that you'd like to treat in parallel. With a limit of 32,767 it's impossible to do what we really need with this data - while R, the free community project, has no such native limitations. If Stata wants to compete with R they're going to have to do something about this.
1 like
Comment

Announcement