Logistic regression models with duplicated data

Mangji Zo

Join Date: Jul 2018
Posts: 62

Logistic regression models with duplicated data

01 Aug 2018, 14:57

Hi,

I am relatively new to using STATA and I have a question regarding the data that I have when running a logistic regression model. The data is duplicated for HHID "102100110201" and others in the dataset after merging various datasets together. I am think this may affect the results of the logistic regression model? For instance, there are two children in one household and when I merged the total_quan data, it appears twice when merged. Is there anyway to fix this problem? Thank you for any advice!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str14 HHID double(total_hhmembers total_quan) byte child_age
"1013000201"    6    .  .
"1013000204"    3  1.5 31
"1013000206"    1    .  .
"1013000210"    1    .  .
"1013000213"    1    .  .
"101300021302"  4    .  .
"1021000102"    5    .  .
"1021000108"    4    .  .
"1021000109"    5    .  .
"1021000110"    6    .  .
"1021000111"    8    .  .
"1021000113"    4   19 32
"1021000201"    5    .  .
"1021000202"    1    .  .
"1021000203"    9    .  .
"102100020304"  4    .  .
"1021000207"    3    .  .
"1021000209"    1    .  .
"1021000210"    1    .  .
"1021000212"    4    .  .
"1021000213"    8    .  .
"1021000303"    1    .  .
"1021000310"    1    .  .
"1021000312"    3    .  .
"1021000313"    1    .  .
"1021000401"    1    .  .
"1021000402"    3    .  .
"1021000405"    4    .  .
"102100040502"  5    .  .
"102100040504"  3    .  .
"1021000406"    5    .  .
"102100040606"  1    .  .
"1021000408"    7    .  .
"1021000409"    2    .  .
"1021000501"    1    .  .
"1021000503"    1    .  .
"1021000504"    4    .  .
"1021000506"    5    .  .
"1021000604"    8    .  .
"1021000607"    8    .  .
"1021000608"    4    .  .
"1021000610"    7    .  .
"1021000612"    6    .  .
"1021000701"    6    .  .
"1021000702"    3    .  .
"1021000703"    1    .  .
"102100070302"  2    .  .
"1021000705"    2    .  .
"1021000709"    8    .  .
"1021000710"   13    .  .
"1021000711"    5    .  .
"1021000802"    1    .  .
"1021000803"    1    .  .
"1021000805"    1    .  .
"102100080503"  1    .  .
"1021000807"    2    .  .
"1021000808"    2    .  .
"102100080803"  5   52 42
"1021000809"    2    .  .
"1021000810"    2    .  .
"1021000811"    5    .  .
"1021000904"    6    .  .
"1021000906"    2    .  .
"1021000909"    3    .  .
"1021001003"    6    .  .
"1021001004"    4    .  .
"1021001005"    5    .  .
"1021001007"    6    .  .
"1021001008"    4    .  .
"1021001009"    2    .  .
"1021001011"    1    .  .
"1021001102"    1    .  .
"102100110201"  7    7  7
"102100110201"  7    7 47
"1021001105"    1    .  .
"1021001107"   10    .  .
"1021001109"   12    .  .
"102100110901"  0    .  .
"102100110903"  3    .  .
"102100110904"  3    .  .
"1021001110"    1    .  .
"1021001205"   11    .  .
"1021001206"    3    .  .
"1021001208"    7    .  .
"1021001210"    4    .  .
"1021001211"    2    .  .
"1021001301"    4    .  .
"1021001302"    5    .  .
"1021001304"    8 52.5 50
"1021001304"    8 52.5 27
"1021001305"    3    .  .
"1021001306"    3    .  .
"1021001307"    4    .  .
"1021001308"    2    .  .
"1021001309"    1    .  .
"1021001311"    4    .  .
"1021001402"    2    .  .
"102100140202"  2    .  .
"102100140204"  1    .  .
"1021001403"    3    .  .
end

------------------ copy up to and including the previous line ------------------

Listed 100 out of 3627 observations
Use the count() option to list more

.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#2

01 Aug 2018, 15:27

-duplicates drop- will eliminate any entirely duplicate observations from your data set.

However, rather than "fixing" the problem this way, I would urge you to first review the data management up to this point to understand why the data contain duplicate observations. It may be that they should be there, but that there should also be some differences between them that somehow got erased. It is also sometimes a symptom of other things gone wrong in data management when a data set that is planned to be unique at a given (household in your case, I take it) level has duplicates. So just dropping the duplicates may well be sweeping the problem under the rug. Instead, you should delve into the problem, understand its origins, and fix whatever led to the problem in the first place.
2 likes
Comment

Announcement

Logistic regression models with duplicated data

Comment