Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing values in the correlation matrix

    Hi,

    if both variables have numeric values in them, and the correlation with other variables in the matrix shows up normally, why do I get missing values for some correlations in the matrix? An example of the data that returned this is Organization Type and Assets Under Management. There are only missing values for the correlation between the two, although the Organization Sub-Type (which is contained in the Organization Type) does not return missing values when correlated with Assets Under Management.

    Explanations for this will be really useful and appreciated as my searches of stata help and forums haven't helped.
    Many thanks,
    Sue

  • #2
    Sue:
    your chances of getting helpful replies depend upon letting us see what you typed and what Stata gave you back (as per FAQ).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hi Carlo,
      okay, I re-did the command for just one of the problematic variables. So this is what I typed:

      mkcorr Insurerdummy AUMd*, log(Try30)

      and this is what I got:

      Insurerdummy AUMdummy1 AUMdummy2 AUMdummy3 AUMdummy4
      Insurerdummy 1.00
      AUMdummy1 . 1.00
      AUMdummy2 . -0.32 1.00
      AUMdummy3 . -0.34 -0.33 1.00
      AUMdummy4 . -0.33 -0.32 -0.35 1.00

      Comment


      • #4
        Sue:
        I fail to see any problem in the results you report in your last post (where are the missing values?).
        However, two asides:
        - the results you posted woud increase readability if you put them in between the code delimiters (# icon), which are included among advanced editor (A icon) options;
        - you seemingly used -mkcorr- for your analysis. As it is not a Stata built-in command, as per FAQ you're asked to report the source you downoaded it from (http://fmwww.bc.edu/RePEc/bocode/m? SSC?). This request is not for "out of curiosity" purposes but stems from the consideration that sometimes different versions of the same user-written programme "are floating around in the cyberspace"; hence, knowing the one that has been used can be helpful for those trying to reply to your query.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          There is nothing missing from this table. Correlation matrices are symmetrical: the above diagonal correlations are identical to their mirror images below the diagonal. It is pretty conventional to display only one half or the other.

          Comment


          • #6
            Copying the results from Sue's post at #3 above into a code block (and fiddling the spacing afterwards) shows that missing values appear as the leftmost column of the table: all the correlations between Insurerdummy and the four AUMdummy variables are missing. If nothing else, this example wonderfully demonstrates why the FAQ requests results be presented in code blocks!

            My guess is that every nonmissing value of Insurerdummy corresponds to missing values for the AUMdummy set, and vice versa. As a newbie, I'm not sure what the best way of exploring that possibility is, but -list 1/10- would seem to be a place to start.

            Code:
            Insurerdummy AUMdummy1 AUMdummy2 AUMdummy3 AUMdummy4  
            Insurerdummy 1.00  
            AUMdummy1    .    1.00  
            AUMdummy2    .    -0.32 1.00  
            AUMdummy3    .    -0.34 -0.33 1.00  
            AUMdummy4    .    -0.33 -0.32 -0.35 1.00

            Comment


            • #7
              Missing correlations are inevitable if one of the variables takes on only a single value. In that circumstance, the corresponding variance is zero. Stata is behaving reasonably if that is so.

              Comment


              • #8
                Sue:
                now I see the issue and thanks to William to point this out.
                I would recommend some basic double-checks (such as -help describe-; -help table-; -help summarize-) on the seemingly guilty variables.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Hello all and thank you for your replies so far. I'm new to Stata and statistics in general so sorry if I'm slow to understand things.

                  William's suggestion that every nonmissing value of Insurerdummy corresponds to missing values for the AUMdummy set, and vice versa doesn't seem to be the case here. I have checked and I do have AUM data for Insurers. So there are 1's in the Insurerdummy variable that correspond to 1's in the AUM quartile variables.

                  Nick:
                  ''Missing correlations are inevitable if one of the variables takes on only a single value. In that circumstance, the corresponding variance is zero. Stata is behaving reasonably if that is so.''
                  Do you mean that the variable only takes on a single value because it is a dummy variable? But all of the other variables in this dataset are dummies and I still get a correlation value between them so why not these?

                  Carlo:
                  How can I check where I downloaded my version of mkcorr from? I just looked for it via stata and clicked on one of the first links that came up in the results for Stata 12

                  Thanks again.
                  Sue

                  Comment


                  • #10
                    No; I didn't mean that. An indicator variable (you say dummy) being 1 or 0 is not in itself a barrier to calculating correlations, as you say. What bites is whenever it is either always 1 or always 0 for the observations for which the correlations are calculated, which you can check.

                    One (and only one) source for mkcorr appears given a search, which is SSC.

                    Comment


                    • #11
                      What seems most plausible to me is that you *do* have different values on insuredummy, but your AUMd series (amount of insurance?) is only being calculated for those cases with insurance. Thus, all cases in the matrix have a value of 1, as Nick indicated. I hope this re-phrasing makes it a little clearer?

                      Comment


                      • #12
                        Hello Nick and Ben and sorry for the long break in replying. I think I understand now what you meant and I think that is indeed the case. So the only observations in the AUM column are for Insurers, that is 1s in the Insurer indicator variable. I have the same problem elsewhere in the dataset where I'm calculating correlations between two indicator variables however it gives me missing values as the correlation when for example all the 1s for indicator variable one correspond to only 0s or only 1s in indicator variable two. So am I understanding correctly what the problem is now?

                        I'm trying to understand statistics but I'm not a quant minded person but - it seems like if it's all 0s vs. all 1s in two indicator variables, the correlation should be just 0...? But it's not calculated as that so there must be an error in my thinking. Do you have any advice on how to handle this? I've never seen missing values in a correlation matrix in a journal before but the data I'm working with is what it is and I can't do much about it. So do any of you have any advice on how to deal with a situation like this?

                        Comment


                        • #13
                          Working backwards:

                          You probably have not seen missing values reported for correlations because authors realised, on their own account or otherwise, that there is no point to reporting them. In examples like yours, the situation is that a row or column should just be omitted from the correlation matrix.

                          Imagine that y = 0 and x = 1 with no other values. Then a scatter plot consists of a single point, repeated, No straight line can be fitted unambiguously to that display. It is true that an infinity of straight lines could be fitted but the underlying relationship is, to put it politely, ambiguous or, to use a more mathematical term, indeterminate.

                          Stata, like any other program, naturally does not settle the point by drawing a plot and thinking what it implies. Computationally, the issue is settled when Stata in effect tries to calculate cov(x, y) / [ sd(x) sd(y) ] and it is sufficient for either sd in the denominator to be zero (which happens whenever either variable is constant) for Stata to throw up its hands and report missing as the only thing it can say; dividing by zero is fatal to the calculation.

                          If you are thinking that the the correlation for (y = 0, x = 1) should be 0, you are thinking perhaps that there is no relationship between the variables, so the correlation should be 0. Close, but the statistical argument is closer to a statement that there is no relationship between the variables in the stronger sense that we can't even say what the relationship is. This differs from say (y = 0, x = 0 sometimes, 1 other times) in which there is no relationship (meaning, no linear relationship) but we can still reasonably summarise the data by a horizontal straight line. The correlation is still indeterminate, but the situation differs.

                          The bottom line is that there is really no problem here once you have realised what is going on. In essence, a constant is not a variable and can't be treated as such.
                          Last edited by Nick Cox; 27 Jan 2015, 11:59.

                          Comment


                          • #14
                            Sue:
                            admittedly, there's only one version of -mkcorr- that is available for downloading. However, set aside from this case, it may well be that the same use-written programme has more than one release. That's why knowing which version of the user-written programme the poster refers to is important, to give useful advice.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Okay, I understand now how the relationship can't actually be determined. Thank you for your help and I'm sorry - my 'Stata problem' turned out to be a 'statistically uninformed person trying to use Stata' problem!

                              Comment

                              Working...
                              X