I have gene expression data from 40 individuals (observations) and 2,000 variables (we measured the gene expression genome-wide, so for 90,000 genes in total, but I'm only allowed 2,000 variables in this stingy version of Stata - and it looks like no version can handle 90K vars anyway...). I also have "age" as a variable.
What I want to do is to compute the regression of age with the expression of each of the 2,000 genes and to find the gene with the highest R^2. Or better yet to rank them by their R^2.
Obviously I need a loop over all variables. I've tried many things but I just can't get the syntax right for this.
Any help would be greatly appreciated.
Also, I'd like to say that high throughput biology is the norm now, so that very few of our data sets have less than 32,767 variables. Therefore Stata has basically taken itself out of the game with this limitation.
Thank you, Greg
What I want to do is to compute the regression of age with the expression of each of the 2,000 genes and to find the gene with the highest R^2. Or better yet to rank them by their R^2.
Obviously I need a loop over all variables. I've tried many things but I just can't get the syntax right for this.
Any help would be greatly appreciated.
Also, I'd like to say that high throughput biology is the norm now, so that very few of our data sets have less than 32,767 variables. Therefore Stata has basically taken itself out of the game with this limitation.
Thank you, Greg
Comment