Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logistic regression models with duplicated data

    Hi,

    I am relatively new to using STATA and I have a question regarding the data that I have when running a logistic regression model. The data is duplicated for HHID "102100110201" and others in the dataset after merging various datasets together. I am think this may affect the results of the logistic regression model? For instance, there are two children in one household and when I merged the total_quan data, it appears twice when merged. Is there anyway to fix this problem? Thank you for any advice!


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str14 HHID double(total_hhmembers total_quan) byte child_age
    "1013000201"    6    .  .
    "1013000204"    3  1.5 31
    "1013000206"    1    .  .
    "1013000210"    1    .  .
    "1013000213"    1    .  .
    "101300021302"  4    .  .
    "1021000102"    5    .  .
    "1021000108"    4    .  .
    "1021000109"    5    .  .
    "1021000110"    6    .  .
    "1021000111"    8    .  .
    "1021000113"    4   19 32
    "1021000201"    5    .  .
    "1021000202"    1    .  .
    "1021000203"    9    .  .
    "102100020304"  4    .  .
    "1021000207"    3    .  .
    "1021000209"    1    .  .
    "1021000210"    1    .  .
    "1021000212"    4    .  .
    "1021000213"    8    .  .
    "1021000303"    1    .  .
    "1021000310"    1    .  .
    "1021000312"    3    .  .
    "1021000313"    1    .  .
    "1021000401"    1    .  .
    "1021000402"    3    .  .
    "1021000405"    4    .  .
    "102100040502"  5    .  .
    "102100040504"  3    .  .
    "1021000406"    5    .  .
    "102100040606"  1    .  .
    "1021000408"    7    .  .
    "1021000409"    2    .  .
    "1021000501"    1    .  .
    "1021000503"    1    .  .
    "1021000504"    4    .  .
    "1021000506"    5    .  .
    "1021000604"    8    .  .
    "1021000607"    8    .  .
    "1021000608"    4    .  .
    "1021000610"    7    .  .
    "1021000612"    6    .  .
    "1021000701"    6    .  .
    "1021000702"    3    .  .
    "1021000703"    1    .  .
    "102100070302"  2    .  .
    "1021000705"    2    .  .
    "1021000709"    8    .  .
    "1021000710"   13    .  .
    "1021000711"    5    .  .
    "1021000802"    1    .  .
    "1021000803"    1    .  .
    "1021000805"    1    .  .
    "102100080503"  1    .  .
    "1021000807"    2    .  .
    "1021000808"    2    .  .
    "102100080803"  5   52 42
    "1021000809"    2    .  .
    "1021000810"    2    .  .
    "1021000811"    5    .  .
    "1021000904"    6    .  .
    "1021000906"    2    .  .
    "1021000909"    3    .  .
    "1021001003"    6    .  .
    "1021001004"    4    .  .
    "1021001005"    5    .  .
    "1021001007"    6    .  .
    "1021001008"    4    .  .
    "1021001009"    2    .  .
    "1021001011"    1    .  .
    "1021001102"    1    .  .
    "102100110201"  7    7  7
    "102100110201"  7    7 47
    "1021001105"    1    .  .
    "1021001107"   10    .  .
    "1021001109"   12    .  .
    "102100110901"  0    .  .
    "102100110903"  3    .  .
    "102100110904"  3    .  .
    "1021001110"    1    .  .
    "1021001205"   11    .  .
    "1021001206"    3    .  .
    "1021001208"    7    .  .
    "1021001210"    4    .  .
    "1021001211"    2    .  .
    "1021001301"    4    .  .
    "1021001302"    5    .  .
    "1021001304"    8 52.5 50
    "1021001304"    8 52.5 27
    "1021001305"    3    .  .
    "1021001306"    3    .  .
    "1021001307"    4    .  .
    "1021001308"    2    .  .
    "1021001309"    1    .  .
    "1021001311"    4    .  .
    "1021001402"    2    .  .
    "102100140202"  2    .  .
    "102100140204"  1    .  .
    "1021001403"    3    .  .
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 100 out of 3627 observations
    Use the count() option to list more

    .

  • #2
    -duplicates drop- will eliminate any entirely duplicate observations from your data set.

    However, rather than "fixing" the problem this way, I would urge you to first review the data management up to this point to understand why the data contain duplicate observations. It may be that they should be there, but that there should also be some differences between them that somehow got erased. It is also sometimes a symptom of other things gone wrong in data management when a data set that is planned to be unique at a given (household in your case, I take it) level has duplicates. So just dropping the duplicates may well be sweeping the problem under the rug. Instead, you should delve into the problem, understand its origins, and fix whatever led to the problem in the first place.

    Comment

    Working...
    X