Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with -merge- command: lot of duplicates values, even after check

    Hi everyone,

    I recently posted a post about how to compute a variable in minutes, in which Clyde Schechter gives me beautiful advices. He contributes to improve also my understanding of the -merge- command, and the mortal sin -merge m:m-. Thank you so much Professor Schechter.
    I corrected my code as recommended by him:

    Originally posted by Clyde Schechter View Post
    It is easy to identify the problem here. You committed what is, in Stata, the equivalent of a mortal sin. You used -merge m:m-, which produces data salad in almost every situation. Even the official StataCorp documentation for the -merge- command says you should pretty much never use this.

    The harder question is discovering why you were lead to do that and what you should do instead. The variables common to both master and using are ID_HOGAR, ID_IND, ID_VIAJE, individ_ID and viaje_ID. The variables individ_ID and viaje_ID are redundant variables that just contain the "tails" of ID_IND and ID_VIAJE, respectively, so we don't have to think about them. In the using dataset, the combination of ID_HOGAR, ID_IND, ID_VIAJE, and ID_ETAPA uniquely identify observations. But if you look at just ID_HOGAR, ID_IND, ID_VIAJE, they do not uniquely identify observations.

    So, in order to do this -merge- one needs to have ID_HOGAR, ID_IND, and ID_VIAJE uniquely identifying observations in the master data set. Alas, they do not. There are numerous instances of duplicates, and some triplicate observations having the same values for all three of these variables. Now, interestingly, it turns out that these observations, at least in the example data, are all exact duplicates on every variable. That is, they are complete, exact copies of each other. The surplus observations contain no additional information and do not belong in the data set. So the underlying problem is with the master data set.

    Now, it is easy enough to get rid of the surplus observations with -duplicates drop-. But I don't recommend doing that, at least not yet. The presence of these in the data probably indicates an error in the data management that led up to the creation of that data set. And where there is one error found, others may still lurk undetected. So I recommend you do a complete review of the work that led to the creation of the master data set to find out where and how those duplicate observations crept in, and then fix the code to remove that and any other errors you find along the way. (If you are truly unable to find or fix the source of the duplicate observations, then use -duplicates drop- before you attempt the -merge-. But it is usually possible to find the source, and that is the safer approach.) Once you have done that, the correct way to combine these data sets will be:

    Code:
    use master_data, clear // AFTER YOU HAVE GOTTEN RID OF THE DUPLICATES
    merge 1:m ID_HOGAR ID_IND ID_VIAJE using using_data
    And remember, if you are tempted to use -merge m:m- it almost always means one of the following:
    1. The data are wrong.
    2. The data are OK but you don't understand them properly.
    3. You need to use -joinby-, not -merge-
    But I still encounter problems, and I do not know how to proceed. My master dataset does not contain any duplicated variables or values:

    Code:
    cd ${maindir}/04_pt_modes_cleaned/data/clean_edm
    use clean_edm_trips_costs_pt_taxis_privatefuelcosts_together.dta, clear
    
    . duplicates drop
    
    Duplicates in terms of all variables
    
    (0 observations are duplicates)
    However, when I do the -merge- process between my masters and using dataset, I get a lot of duplicated values.

    Here's the code I used:
    Code:
    *2. Merge with master data clean_edm_trips_costs_pt_taxis_privatefuelcosts_togetherd.dta
    
    cd ${maindir}/04_pt_modes_cleaned/data/clean_edm
    use clean_edm_trips_costs_pt_taxis_privatefuelcosts_togetherd.dta, clear
    
    // merge the EESTACIONA variable. EESTACIONA tells us where people have parked.
    merge 1:m ID_HOGAR ID_IND ID_VIAJE using ${data}/Microdata/etapas_edm_2018_parking_data.dta, keepusing(EESTACIONA workplace_parking free_parking ser_tariffs public_tariffs)
    Finally, here are my master dataset:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 individ_ID long ID_HOGAR str12 viaje_ID float date byte N_ETAPAS_POR_VIAJE
    "1734079_3" 1734079 "1734079_3_1" 21223 3
    "1734079_2" 1734079 "1734079_2_2" 21223 1
    "1734079_4" 1734079 "1734079_4_2" 21223 1
    "1734079_1" 1734079 "1734079_1_2" 21223 1
    "1734079_3" 1734079 "1734079_3_2" 21223 3
    "1734079_2" 1734079 "1734079_2_1" 21223 1
    "1734079_4" 1734079 "1734079_4_1" 21223 1
    "1734079_1" 1734079 "1734079_1_1" 21223 1
    "2463005_1" 2463005 "2463005_1_1" 21227 1
    "1953550_1" 1953550 "1953550_1_8" 21227 1
    "245520_3"   245520 "245520_3_5"  21227 1
    "522412_1"   522412 "522412_1_1"  21227 1
    "1493515_2" 1493515 "1493515_2_2" 21227 5
    "1777111_4" 1777111 "1777111_4_3" 21227 1
    "1107787_1" 1107787 "1107787_1_2" 21227 1
    "2010611_3" 2010611 "2010611_3_2" 21227 1
    "1093214_2" 1093214 "1093214_2_2" 21227 1
    "1093214_2" 1093214 "1093214_2_4" 21227 1
    "506289_2"   506289 "506289_2_1"  21227 1
    "1687815_3" 1687815 "1687815_3_1" 21227 1
    "527805_3"   527805 "527805_3_2"  21227 3
    "1957996_1" 1957996 "1957996_1_5" 21227 1
    "1953550_1" 1953550 "1953550_1_1" 21227 1
    "1953550_2" 1953550 "1953550_2_1" 21227 1
    "1575546_1" 1575546 "1575546_1_3" 21227 1
    "548892_1"   548892 "548892_1_3"  21227 1
    "1777111_4" 1777111 "1777111_4_2" 21227 1
    "2428229_1" 2428229 "2428229_1_7" 21227 1
    "2009099_1" 2009099 "2009099_1_3" 21227 1
    "1668877_2" 1668877 "1668877_2_2" 21227 2
    "1318545_2" 1318545 "1318545_2_4" 21227 1
    "1874598_1" 1874598 "1874598_1_1" 21227 1
    "1874598_2" 1874598 "1874598_2_1" 21227 1
    "1953550_1" 1953550 "1953550_1_5" 21227 1
    "1576271_3" 1576271 "1576271_3_3" 21227 1
    "556947_3"   556947 "556947_3_4"  21227 1
    "2009099_2" 2009099 "2009099_2_1" 21227 1
    "527805_1"   527805 "527805_1_1"  21227 1
    "2024382_2" 2024382 "2024382_2_2" 21227 1
    "2461844_3" 2461844 "2461844_3_2" 21227 1
    "546676_4"   546676 "546676_4_1"  21227 1
    "653680_4"   653680 "653680_4_2"  21227 1
    "1732362_2" 1732362 "1732362_2_3" 21227 2
    "2009350_2" 2009350 "2009350_2_1" 21227 1
    "2023802_1" 2023802 "2023802_1_1" 21227 2
    "491803_1"   491803 "491803_1_1"  21227 1
    "546676_2"   546676 "546676_2_2"  21227 1
    "2010146_2" 2010146 "2010146_2_5" 21227 1
    "2010146_2" 2010146 "2010146_2_1" 21227 1
    "556947_3"   556947 "556947_3_1"  21227 1
    "270583_2"   270583 "270583_2_1"  21227 1
    "653680_4"   653680 "653680_4_1"  21227 1
    "2010146_5" 2010146 "2010146_5_2" 21227 1
    "2173351_3" 2173351 "2173351_3_2" 21227 1
    "2023802_2" 2023802 "2023802_2_1" 21227 1
    "1957996_2" 1957996 "1957996_2_3" 21227 1
    "1318545_2" 1318545 "1318545_2_1" 21227 1
    "1397823_1" 1397823 "1397823_1_1" 21227 1
    "2010146_4" 2010146 "2010146_4_2" 21227 1
    "1093214_2" 1093214 "1093214_2_1" 21227 1
    "30408_1"     30408 "30408_1_1"   21227 4
    "2461418_2" 2461418 "2461418_2_3" 21227 1
    "245520_3"   245520 "245520_3_1"  21227 1
    "246831_2"   246831 "246831_2_1"  21227 4
    "245520_1"   245520 "245520_1_3"  21227 1
    "1557100_1" 1557100 "1557100_1_3" 21227 1
    "2009099_2" 2009099 "2009099_2_4" 21227 1
    "2023802_3" 2023802 "2023802_3_2" 21227 1
    "1107787_2" 1107787 "1107787_2_2" 21227 2
    "1576271_3" 1576271 "1576271_3_1" 21227 1
    "1387349_4" 1387349 "1387349_4_2" 21227 3
    "1953550_2" 1953550 "1953550_2_3" 21227 1
    "2462103_1" 2462103 "2462103_1_3" 21227 2
    "2009099_3" 2009099 "2009099_3_1" 21227 2
    "246831_3"   246831 "246831_3_1"  21227 1
    "721775_1"   721775 "721775_1_2"  21227 1
    "1588143_2" 1588143 "1588143_2_1" 21227 1
    "1614160_2" 1614160 "1614160_2_1" 21227 1
    "1668877_3" 1668877 "1668877_3_5" 21227 1
    "547785_1"   547785 "547785_1_1"  21227 1
    "2010146_4" 2010146 "2010146_4_1" 21227 1
    "2428229_1" 2428229 "2428229_1_1" 21227 1
    "1687563_2" 1687563 "1687563_2_1" 21227 1
    "522412_1"   522412 "522412_1_2"  21227 1
    "68123_1"     68123 "68123_1_2"   21227 1
    "1162081_2" 1162081 "1162081_2_1" 21227 1
    "1777111_1" 1777111 "1777111_1_1" 21227 1
    "175062_2"   175062 "175062_2_1"  21227 1
    "1162081_3" 1162081 "1162081_3_2" 21227 2
    "246831_3"   246831 "246831_3_4"  21227 3
    "2009099_1" 2009099 "2009099_1_2" 21227 1
    "246831_1"   246831 "246831_1_7"  21227 1
    "1794675_2" 1794675 "1794675_2_2" 21227 1
    "1499476_1" 1499476 "1499476_1_2" 21227 1
    "721775_3"   721775 "721775_3_2"  21227 1
    "1176130_3" 1176130 "1176130_3_1" 21227 1
    "721775_1"   721775 "721775_1_1"  21227 1
    "1397769_2" 1397769 "1397769_2_2" 21227 2
    "1668877_3" 1668877 "1668877_3_1" 21227 1
    "1765717_2" 1765717 "1765717_2_3" 21227 1
    end
    format %td date
    And here my "using" dataset:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 individ_ID str14 etapa_ID long ID_HOGAR byte ID_ETAPA
    "1000154_1" "1000154_1_1_1" 1000154 1
    "1000154_1" "1000154_1_2_1" 1000154 1
    "1000154_2" "1000154_2_1_1" 1000154 1
    "1000154_2" "1000154_2_1_2" 1000154 2
    "1000154_2" "1000154_2_2_1" 1000154 1
    "1000154_2" "1000154_2_2_2" 1000154 2
    "1000289_1" "1000289_1_1_1" 1000289 1
    "1000289_1" "1000289_1_1_2" 1000289 2
    "1000289_1" "1000289_1_1_3" 1000289 3
    "1000289_1" "1000289_1_2_1" 1000289 1
    "1000289_1" "1000289_1_2_2" 1000289 2
    "1000289_1" "1000289_1_2_3" 1000289 3
    "1000289_3" "1000289_3_1_1" 1000289 1
    "1000289_3" "1000289_3_2_1" 1000289 1
    "1000289_3" "1000289_3_3_1" 1000289 1
    "1000289_3" "1000289_3_4_1" 1000289 1
    "1000370_1" "1000370_1_1_1" 1000370 1
    "1000370_1" "1000370_1_2_1" 1000370 1
    "1000370_1" "1000370_1_3_1" 1000370 1
    "1000370_1" "1000370_1_4_1" 1000370 1
    "1000370_2" "1000370_2_1_1" 1000370 1
    "1000370_2" "1000370_2_2_1" 1000370 1
    "1000457_1" "1000457_1_1_1" 1000457 1
    "1000457_1" "1000457_1_2_1" 1000457 1
    "1000457_2" "1000457_2_1_1" 1000457 1
    "1000457_2" "1000457_2_2_1" 1000457 1
    "1000457_4" "1000457_4_1_1" 1000457 1
    "1000457_4" "1000457_4_1_2" 1000457 2
    "1000457_4" "1000457_4_2_1" 1000457 1
    "1000457_4" "1000457_4_2_2" 1000457 2
    "1000660_1" "1000660_1_1_1" 1000660 1
    "1000660_1" "1000660_1_1_2" 1000660 2
    "1000660_1" "1000660_1_2_1" 1000660 1
    "1000660_1" "1000660_1_2_2" 1000660 2
    "1001299_1" "1001299_1_1_1" 1001299 1
    "1001299_1" "1001299_1_2_1" 1001299 1
    "1001530_1" "1001530_1_1_1" 1001530 1
    "1001530_1" "1001530_1_2_1" 1001530 1
    "1001530_1" "1001530_1_3_1" 1001530 1
    "1001530_1" "1001530_1_4_1" 1001530 1
    "1001728_1" "1001728_1_1_1" 1001728 1
    "1001728_1" "1001728_1_2_1" 1001728 1
    "1001728_2" "1001728_2_1_1" 1001728 1
    "1001728_2" "1001728_2_2_1" 1001728 1
    "1001942_1" "1001942_1_1_1" 1001942 1
    "1001942_1" "1001942_1_2_1" 1001942 1
    "1001942_2" "1001942_2_1_1" 1001942 1
    "1001942_2" "1001942_2_2_1" 1001942 1
    "1002132_1" "1002132_1_1_1" 1002132 1
    "1002132_1" "1002132_1_1_2" 1002132 2
    "1002132_1" "1002132_1_2_1" 1002132 1
    "1002132_1" "1002132_1_2_2" 1002132 2
    "1002132_2" "1002132_2_1_1" 1002132 1
    "1002132_2" "1002132_2_1_2" 1002132 2
    "1002132_2" "1002132_2_1_3" 1002132 3
    "1002132_2" "1002132_2_2_1" 1002132 1
    "1002132_2" "1002132_2_2_2" 1002132 2
    "1002132_2" "1002132_2_2_3" 1002132 3
    "1002179_1" "1002179_1_1_1" 1002179 1
    "1002179_1" "1002179_1_2_1" 1002179 1
    "1002291_1" "1002291_1_1_1" 1002291 1
    "1002291_1" "1002291_1_2_1" 1002291 1
    "1002291_1" "1002291_1_3_1" 1002291 1
    "1002291_1" "1002291_1_4_1" 1002291 1
    "1002291_2" "1002291_2_1_1" 1002291 1
    "1002291_2" "1002291_2_2_1" 1002291 1
    "1002315_1" "1002315_1_1_1" 1002315 1
    "1002315_1" "1002315_1_2_1" 1002315 1
    "1002400_1" "1002400_1_1_1" 1002400 1
    "1002400_1" "1002400_1_2_1" 1002400 1
    "1002400_1" "1002400_1_3_1" 1002400 1
    "1002400_1" "1002400_1_4_1" 1002400 1
    "1002400_1" "1002400_1_5_1" 1002400 1
    "1002400_1" "1002400_1_6_1" 1002400 1
    "1002400_2" "1002400_2_1_1" 1002400 1
    "1002400_2" "1002400_2_2_1" 1002400 1
    "1002400_2" "1002400_2_3_1" 1002400 1
    "1002400_2" "1002400_2_4_1" 1002400 1
    "1002400_3" "1002400_3_1_1" 1002400 1
    "1002400_3" "1002400_3_1_2" 1002400 2
    "1002400_3" "1002400_3_2_1" 1002400 1
    "1002400_3" "1002400_3_2_2" 1002400 2
    "1002741_1" "1002741_1_1_1" 1002741 1
    "1002741_2" "1002741_2_1_1" 1002741 1
    "1002741_3" "1002741_3_1_1" 1002741 1
    "1002741_3" "1002741_3_2_1" 1002741 1
    "1002741_4" "1002741_4_1_1" 1002741 1
    "1002741_4" "1002741_4_2_1" 1002741 1
    "1002741_5" "1002741_5_1_1" 1002741 1
    "1002741_5" "1002741_5_1_2" 1002741 2
    "1002741_5" "1002741_5_2_1" 1002741 1
    "1002741_5" "1002741_5_2_2" 1002741 2
    "1002741_6" "1002741_6_1_1" 1002741 1
    "1002809_1" "1002809_1_1_1" 1002809 1
    "1002809_1" "1002809_1_2_1" 1002809 1
    "1002809_1" "1002809_1_3_1" 1002809 1
    "1002809_2" "1002809_2_1_1" 1002809 1
    "1002809_2" "1002809_2_2_1" 1002809 1
    "1002809_2" "1002809_2_3_1" 1002809 1
    "1002956_2" "1002956_2_1_1" 1002956 1
    end
    The merged result is:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12 viaje_ID byte ID_IND long ID_HOGAR float date byte N_ETAPAS_POR_VIAJE str4(VORIHORAINI VDESHORAFIN)
    "189_1_1"  1  189 21299 1 "0800" "0830"
    "189_1_2"  1  189 21299 1 "1800" "1830"
    "244_1_1"  1  244 21300 1 "0800" "0820"
    "244_1_2"  1  244 21300 1 "1800" "1820"
    "244_2_1"  2  244 21300 1 "1113" "1115"
    "244_2_2"  2  244 21300 1 "1218" "1220"
    "324_1_1"  1  324 21299 1 "1117" "1120"
    "324_1_2"  1  324 21299 1 "1312" "1315"
    "414_1_1"  1  414 21300 1 "1059" "1100"
    "414_1_2"  1  414 21300 1 "1319" "1320"
    "414_2_1"  2  414 21300 1 "1156" "1200"
    "414_2_2"  2  414 21300 1 "1256" "1300"
    "414_3_1"  3  414 21300 1 "0841" "0845"
    "414_3_2"  3  414 21300 1 "1711" "1715"
    "414_3_3"  3  414 21300 1 "1900" "1920"
    "581_1_1"  1  581 21299 1 "0700" "0705"
    "581_2_1"  2  581 21299 1 "0800" "0830"
    "581_2_2"  2  581 21299 1 "1500" "1530"
    "581_3_1"  3  581 21299 1 "0856" "0900"
    "581_3_2"  3  581 21299 1 "1611" "1615"
    "581_4_1"  4  581 21299 1 "0830" "0900"
    "581_4_2"  4  581 21299 1 "1600" "1615"
    "742_1_1"  1  742 21299 1 "0700" "0800"
    "742_1_2"  1  742 21299 1 "1600" "1645"
    "742_2_1"  2  742 21299 2 "0800" "0930"
    "742_2_2"  2  742 21299 2 "1700" "1820"
    "759_1_1"  1  759 21300 1 "0740" "0800"
    "759_1_2"  1  759 21300 1 "0800" "0815"
    "759_1_3"  1  759 21300 1 "1433" "1445"
    "759_1_4"  1  759 21300 1 "1445" "1510"
    "759_2_1"  2  759 21300 1 "0645" "0800"
    "759_2_2"  2  759 21300 1 "1800" "1930"
    "759_3_1"  3  759 21300 1 "0730" "0815"
    "759_3_2"  3  759 21300 1 "1415" "1500"
    "759_4_1"  4  759 21300 1 "0745" "0800"
    "759_4_2"  4  759 21300 1 "1445" "1500"
    "914_1_1"  1  914 21299 1 "1100" "1115"
    "914_1_2"  1  914 21299 1 "1300" "1320"
    "914_2_1"  2  914 21299 1 "0800" "0815"
    "914_2_2"  2  914 21299 1 "1800" "1820"
    "1096_1_1" 1 1096 21300 2 "0740" "0800"
    "1096_1_2" 1 1096 21300 3 "1500" "1600"
    "1096_2_1" 2 1096 21300 1 "0857" "0900"
    "1096_2_2" 2 1096 21300 1 "0907" "0915"
    "1096_2_3" 2 1096 21300 1 "1159" "1200"
    "1096_2_4" 2 1096 21300 1 "1349" "1400"
    "1096_2_5" 2 1096 21300 1 "1406" "1410"
    "1096_3_1" 3 1096 21300 1 "0851" "0900"
    "1096_3_2" 3 1096 21300 1 "1400" "1410"
    "1479_1_1" 1 1479 21300 1 "0812" "0815"
    "1479_1_2" 1 1479 21300 1 "0815" "0900"
    
    end
    format %td date
    I now have 257,138 observations, instead of my initial 222,744.

    Could anyone give me a solution please?

    I don't know if this is useful, but my master dataset aggregates each individual's journey into a variable.
    For example, instead of detailing each stage for each journey, the master dataset aggregates the number of stages per journey into the variable N_ETAPAS_POR_VIAJE.

    On the other hand, the using dataset details each stage of each trip for each individual. For example, an observation for stage 1 of trip 1, an observation for stage 2 of trip 1, etc. It is represented by the variable ID_ETAPA.

    Thank you very much for the help provided.
    Best,

    Michael
    Last edited by Michael Duarte Goncalves; 24 Oct 2023, 01:58.

  • #2
    It is hard to tell what is going on from the data provided. For example, the record
    Code:
     
     "189_1_1"  1  189 21299 1 "0800" "0830"
    appears in the merged data, but not in either of the master or using data. Maybe if you cut the example down to a handfull of records that displayed the problem, it would be easier to diagnose. The "keep using" means that the merged file will have the same number of records as the using file is that not happening?

    Perhaps I am a minority in this issue, but almost any Stata difficulty can be displayed with 3 or fewer records, and having a very small dataset to ponder makes it much easier to diagnose a problem. I see so many long dataex displays, and the size discourages me from looking closely. I have the same attitude towards long variable lists when only one or two variables are relevant, and overlong variable names.

    Comment


    • #3
      Thanks for your feedback Daniel Feenberg

      You're right about the dataex. I apologise for that.

      My line managers gave me a way for the two datasets to match in terms of number of observations: using -duplicates drop varlist-.

      So, let's say the problem is solved, at least in appearance.
      Thank you again.

      Best regards,

      Michael

      Comment

      Working...
      X