How to find best similar variables using overlaid graph or histograms or else?

Priver JM

Join Date: Feb 2019

Posts: 30
#1

How to find best similar variables using overlaid graph or histograms or else?

31 Mar 2019, 13:48

I have 51 variables containing 6,120 values of each variable. For example, resid1, resid2, ... resid51.
These are predicted values having different residuals after regressions.
What I want to do here is to find best similar residual trend over 50 residuals which will be the best matched with resid1.
If I make only 10 graphs using following code, it is too hard to find one which has the most similar trend with resid1 because they are overlaid too close.

Code:

line resid resid2 resid3 resid4 resid5 resid6 resid7 resid8 resid9 resid10 fips, legend(label(1 "1") label(2 "2") label(3 "3") label(4 "4") label(5 "5") label(6 "6") label(7 "7") label(8 "8") label(9 "9") label(10 "10"))

So if I draw 51 graphs, it will be harder to differentiate with each other to find the most similar one.
Is there other different ways to find this trend?

Thanks.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35793
#2

01 Apr 2019, 04:09

A first stab at this is to look at the correlations between resid1 and the others. Crudely but usefully, you can put those in a variable, assuming you have at least 51 observations.

Code:

gen correlation = . gen which = _n quietly forval j = 2/51 { correlate resid1 resid`j' replace correlation = r(rho) in `j' } gsort -correlation list which correlation in 1/7

Now check out the top candidates by plotting them.

A twist on this is to use concordance correlation, which does measure agreement, not linearity.
Comment
Priver JM

Join Date: Feb 2019

Posts: 30
#3

01 Apr 2019, 08:49

Thank you for your suggestion.
However, why the results are different every time when I run this code again?
I think that I should have added this information at first. My current assumption is that the predicted values from resid1 to resid51 are changing.
Here's the code to predict resid1 and rest of the residuals. So basically residuals are predicted after regressions.

Code:

global X = "per_w per_h inc flfp" reg lui $X if fips== 1 & treat==1 & post== 0 predict resid1, r forvalue i = 2/51 { reg lui $X if fips==`i'& treat== 0 & post == 0 predict resid`i', r } gen correlation = . gen which = _n quietly forval j = 2/51 { correlate resid1 resid`j' replace correlation = r(rho) in `j' } gsort -correlation list which correlation in 1/7
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35793
#4

01 Apr 2019, 09:02

I have no idea what fips is or are, but your model is here fitted in turn to disjoint groups, yet your residuals are calculated out of sample too. Is that what you want?

I can't see any reason for results changing each time. They will possibly be in different observations, but collectively they should be the same.
Comment
Priver JM

Join Date: Feb 2019

Posts: 30
#5

02 Apr 2019, 05:52

Here, the fips are numbers assigned in every States from 1 to 51.

From the above codes, I have total 51 residuals of each State which is assigned from 1 to 51 by fips and these are resid1, resid2, ... resid51.

These residuals are predicted after running a regression and each residual variable has 6120 values. But the problem is these residuals are changing each time. I think that's why the above correlation result is changing each time. Is there a way to fix it?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35793
#6

02 Apr 2019, 06:03

Thanks for the further detail, but nothing you’ve said leads me to change my previous answer. The regression results should be the same for the same data. It’s likely that the gsort will shuffle them, as said.

Last edited by Nick Cox; 02 Apr 2019, 06:41.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35793
#7

02 Apr 2019, 07:20

Otherwise put, if you repeat the code in #3 what reason could there possibly be for different results? If you were using different datasets, or otherwise changing the data, you would be telling us. The correlations are invariant to the sort order of the data.
Comment

Announcement

How to find best similar variables using overlaid graph or histograms or else?

Comment

Comment

Comment

Comment

Comment

Comment