Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multivariate regression

    Hi everyone.
    I have a dataset that contains more than 1600 individuals and most of the variables has yes, no , nk values. All the nk values were removed. Once I did a multivariate regression only 539 observations was included. Could someone please explain why and what does it mean?

    Thank you!

  • #2
    When you say that the nk values were "removed," do you mean that they were replaced by missing value? If that is what you meant, then the answer is clear. In Stata, or any other statistical package, regressions are carried out excluding any observation that has a missing value for any variable mentioned in the regression command. So it seems that there were only 539 observations in the data that contained only non-missing values on all of the regression variables.

    Comment


    • #3
      thank you for your response, no I used "drop if missing" command.

      Comment


      • #4
        Well, in that case, you deleted those observations from the data set altogether, so they were no longer there for use in the regression. The result is the same either way.

        Comment


        • #5
          since the observations are few, would you recommend keeping the NK along with yes and no? i have death, complications, pneumonia , apnoeic attacks , convulsion , conjunctivitis and hospitalization as dependent variables and age groups, sex, coinfection status and regions as independent variables. To me it dose not make sense including the NK even though in some variables they are more than 300 hundreds.
          I want to test if they is any association between the dependent variables (clinical severity , outcome) for the disease in two groups. Group A contains individuals with the disease , group B contains individuals with the disease and other infections.

          Comment


          • #6
            I think it is fair to say that keeping the nk responses and treating them as a third response category in those variables is not appropriate. These unknown responses are, in fact, some unknown mixture of people for whom the "true" response is yes and those for whom it is no. If you include the nk's in the analysis, your results for the yes and no categories apply only to those who failed to provide an interpretable response, and therefore the results are biased unless the nk responses occur completely at random. (That is, the occurrence of nk response has no relationship to any of the variables you plan to use in your analysis. While there are many circumstances in the world of data collection that lead to missing responses completely at random, in questionnaire data this is extremely unusual, to say the least. It seems nearly impossible, in fact.)

            In your situation, about 1/3 of your observations had these nk responses, which is a rather large number. So I worry about the quality of the data to begin with. Why were there so many nk responses? Were the questions asked in an unclear manner? Was the content of the questions simply beyond the scope of knowledge/understanding of your population? Were there large numbers of errors in recording the data? I think the real issue here is what you can do to improve the quality of the data set. Now, it may be that there is nothing you can do about this and you have to live with it. But in that case, you have to acknowledge that any analysis of faulty data is unlikely to produce dependable results.

            It may be possible to salvage the situation if for at least some of the items where nk repsonses are frequent there are other items in the data that are likely to be fairly strongly associated with the missing items. In that case, it may be reasonable to attempt to impute a yes or no response to the observations with nk response using the responses to the associated other items. The validity of this approach depends on the unverifiable assumption that, conditional on the responses to the other items, the occurrence of the nk response is itself independent of the "true" yes/no response for that observation. (This is known as missingness at random, or MAR.) If this assumption seems reasonable, then multiple imputation may be used. But this is a rather complicated procedure to use, and given that you are asking questions about a much more elementary approach, you may find that this is over your head and can only be accomplished with substantial participation by a more experienced analyst. It is not the kind of thing one should be doing when you are just learning to use Stata and have only an introductory statistics course (if that) behind you.

            Another approach that might be possible is instead of dropping the observations where there are nk responses, if the variables involved are not critical to achieving your research goals, you might omit some or all of those variables instead, keeping the 1600 observations but using only variables for which there are no, or only a few, missing responses.

            I'm sorry to be delivering such a pessimistic response. But missing data is a vexatious problem for which there are no truly good solutions; one tries to find the least bad solution for the situation at hand.

            Comment


            • #7
              What does nk mean? Not Known?

              why is the data missing? Did people just refuse to answer? Were they not asked the question for some reason? Was the question not applicable to them, e.g. questions about their children are not asked if they have no children?

              Like Clyde says, that much missing data makes you wonder about data quality. On the other hand, there could be perfectly good reasons for having that much missing, and 539 may be a perfectly reasonable N.

              in short you need to know why the data are missing before you make a decision on what to do about it.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://academicweb.nd.edu/~rwilliam/

              Comment


              • #8
                Thank you both for your responses! I noticed that if I included one variable with the rest of the dependent variables in a multivariate model ,the total number of observations becomes less. this variable has both NK and missing values, I'm guessing that is why it's influencing the total number of observations?

                Comment


                • #9
                  Yes. Any observation that contains a missing value on any of the regression variables is excluded from the estimation sample. So including a variable with missing values leads to a decrease in sample size.

                  Comment

                  Working...
                  X