Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Date variables: Compute a variable in minutes

    Hi everyone,

    I want to create a variable in minutes, from two variables in date %tC format. In fact, for each -individ_ID- I want to calculate the difference between the arrival time of the 1st trip (end_time) and the departure time of the second trip (start_time). Then the time difference between the arrival time of the second trip if it exists (end_time) and the departure time of the 3rd trip (start_time), etc.

    The journey variable corresponds to -ID_VIAJE- and 1 means 1st journey, 2 means 2nd journey and so on. Below is an example of my dataset.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 individ_ID long ID_HOGAR byte ID_VIAJE double(start_time end_time)
    "189_1"   189 2 64800000 66600000
    "189_1"   189 1 28800000 30600000
    "244_2"   244 1 40380000 40500000
    "244_1"   244 1 28800000 30000000
    "244_1"   244 2 64800000 66000000
    "244_2"   244 2 44280000 44400000
    "324_1"   324 1 40620000 40800000
    "324_1"   324 2 47520000 47700000
    "414_1"   414 2 47940000 48000000
    "414_3"   414 1 31260000 31500000
    "414_2"   414 1 42960000 43200000
    "414_1"   414 1 39540000 39600000
    "414_3"   414 2 61860000 62100000
    "414_3"   414 3 68400000 69600000
    "414_2"   414 2 46560000 46800000
    "581_3"   581 1 32160000 32400000
    "581_2"   581 2 54000000 55800000
    "581_1"   581 1 25200000 25500000
    "581_4"   581 2 57600000 58500000
    "581_4"   581 1 30600000 32400000
    "581_3"   581 2 58260000 58500000
    "581_2"   581 1 28800000 30600000
    "742_1"   742 1 25200000 28800000
    "742_2"   742 2 61200000 66000000
    "742_1"   742 2 57600000 60300000
    "742_2"   742 1 28800000 34200000
    "759_3"   759 2 51300000 54000000
    "759_1"   759 4 53100000 54600000
    "759_2"   759 2 64800000 70200000
    "759_4"   759 2 53100000 54000000
    "759_2"   759 1 24300000 28800000
    "759_3"   759 1 27000000 29700000
    "759_1"   759 1 27600000 28800000
    "759_4"   759 1 27900000 28800000
    "759_1"   759 3 52380000 53100000
    "759_1"   759 2 28800000 29700000
    "914_2"   914 1 28800000 29700000
    "914_1"   914 2 46800000 48000000
    "914_2"   914 2 64800000 66000000
    "914_1"   914 1 39600000 40500000
    "1096_2" 1096 5 50760000 51000000
    "1096_1" 1096 1 27600000 28800000
    "1096_1" 1096 2 54000000 57600000
    "1096_2" 1096 2 32820000 33300000
    "1096_2" 1096 4 49740000 50400000
    "1096_2" 1096 3 43140000 43200000
    "1096_3" 1096 1 31860000 32400000
    "1096_3" 1096 2 50400000 51000000
    "1096_2" 1096 1 32220000 32400000
    "1479_2" 1479 1 27000000 29700000
    "1479_1" 1479 1 29520000 29700000
    "1479_1" 1479 4 61200000 61800000
    "1479_1" 1479 3 57600000 61200000
    "1479_3" 1479 2 52020000 52200000
    "1479_2" 1479 2 51300000 54000000
    "1479_1" 1479 2 29700000 32400000
    "1479_3" 1479 1 29520000 29700000
    "1508_2" 1508 1 33300000 34800000
    "1508_2" 1508 2 61200000 63000000
    "1508_1" 1508 2 64800000 66600000
    "1508_1" 1508 1 28800000 31200000
    "1710_3" 1710 1 50040000 50400000
    "1710_3" 1710 2 61200000 62460000
    "1710_1" 1710 2 34560000 35760000
    "1710_2" 1710 2 68400000 73200000
    "1710_2" 1710 1 28800000 31500000
    "1710_4" 1710 2 56340000 56700000
    "1710_3" 1710 3 72000000 72300000
    "1710_1" 1710 1 30600000 34200000
    "1710_4" 1710 1 53640000 54000000
    "1710_1" 1710 3 63000000 66900000
    "2275_3" 2275 1 29460000 30600000
    "2275_2" 2275 1 29460000 30000000
    "2322_5" 2322 1 51120000 57600000
    "2322_1" 2322 1 31200000 36000000
    "2322_3" 2322 1 51120000 57600000
    "2332_1" 2332 4 78300000 81000000
    "2332_1" 2332 5 82800000 84300000
    "2332_1" 2332 2 64800000 67500000
    "2332_1" 2332 1 25200000 29400000
    "2502_2" 2502 2 66600000 69300000
    "2502_3" 2502 2 50400000 55200000
    "2502_2" 2502 1 21600000 23400000
    "2502_3" 2502 1 23400000 27000000
    "2882_2" 2882 1 52920000 59400000
    "3216_5" 3216 2 73800000 79200000
    "3216_2" 3216 2 66600000 72000000
    "3216_3" 3216 3 70200000 72000000
    "3216_1" 3216 1 23100000 27600000
    "3216_1" 3216 3 68400000 70200000
    "3216_5" 3216 1 52200000 56100000
    "3216_1" 3216 2 50400000 54900000
    "3216_4" 3216 1 37800000 45000000
    "3216_3" 3216 1 37800000 45000000
    "3216_2" 3216 1 36000000 41400000
    "3216_4" 3216 2 62160000 69300000
    "3216_3" 3216 2 63000000 69300000
    "3230_1" 3230 4 68400000 70200000
    "3230_1" 3230 2 38700000 42300000
    "3230_1" 3230 1 36000000 37800000
    end
    format %tC start_time
    format %tC end_time
    Could someone please help me solve this problem?
    Thank you in advance.

    Best,

    Michael

  • #2
    Code:
    by individ_ID (start_time), sort: gen interlude = ///
        clockdiff(end_time[_n-1], start_time, "m") if _n > 1

    Comment


    • #3
      Oh, nice! Thank you so much for your help Clyde Schechter!

      It works beautifully well.
      Best wishes,

      Michael

      Comment


      • #4
        Hi Clyde Schechter:

        I have a question. I use the code that you presented in #2. Thank you so much for that.

        But, I obtain some negative values. Could you tell me why please. I don't understand why.


        Code:
        . sum interlude, d
        
                 time elapsed in min. between end[_n-1] and
                                 start[_n]
        --------------------------------------------------------
        > -----
              Percentiles      Smallest
         1%            0            -10
         5%            0             -6
        10%            5             -5       Obs             14
        > 7,536
        25%           45             -3       Sum of wgt.     14
        > 7,536
        
        50%          130                      Mean           203
        > .3953
                                Largest       Std. dev.      193
        > .3225
        75%          350           1075
        90%          495           1080       Variance       373
        > 73.57
        95%          575           1082       Skewness       .90
        > 40786
        99%          710           1100       Kurtosis        2.
        > 85176
        
        .
        I tried:

        Code:
        hms(h,m,s) function
        But it is not solving the problem.

        Here is an example of my dataset:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str9 individ_ID long ID_HOGAR byte ID_VIAJE float date byte(start_hour start_minute end_hour end_minute) double seconds
        "1734079_2" 1734079 2 21223 19 45 20 15 0
        "1734079_4" 1734079 2 21223 16  0 16 10 0
        "1734079_3" 1734079 2 21223 21  0 22  0 0
        "1734079_1" 1734079 1 21223  9  0  9 40 0
        "1734079_2" 1734079 1 21223  9  0  9 20 0
        "1734079_3" 1734079 1 21223 14 30 15  0 0
        "1734079_4" 1734079 1 21223  9 50  9 55 0
        "1734079_1" 1734079 2 21223 13 20 14 20 0
        "506289_2"   506289 1 21227 13 12 15  0 0
        "1957996_1" 1957996 4 21227 16 52 16 55 0
        "1953550_1" 1953550 3 21227 12  6 12 10 0
        "506289_2"   506289 2 21227 15  0 15 20 0
        "2150283_2" 2150283 1 21227 10  0 10 30 0
        "1957826_1" 1957826 1 21227 14  1 14 15 0
        "2461844_3" 2461844 2 21227 14  0 14 15 0
        "849431_1"   849431 2 21227 13  0 14  0 0
        "175062_2"   175062 6 21227 18 30 19  0 0
        "1401789_2" 1401789 7 21227 20  4 20  5 0
        "1687815_3" 1687815 2 21227 17  0 17 50 0
        "307612_2"   307612 2 21227 19 30 19 45 0
        "307612_2"   307612 1 21227  9  0  9 15 0
        "2173351_2" 2173351 1 21227 22 10 22 45 0
        "1397823_2" 1397823 1 21227  7  0  7 50 0
        "1401789_2" 1401789 6 21227 17  4 17  5 0
        "1318545_1" 1318545 1 21227 13 45 14  0 0
        "546676_2"   546676 3 21227 18  0 18 10 0
        "1777111_4" 1777111 3 21227 16 35 16 40 0
        "2010146_2" 2010146 1 21227  8 46  8 55 0
        "1087049_2" 1087049 1 21227  9  0  9 45 0
        "1957826_2" 1957826 3 21227  8 20  9  0 0
        "2023802_3" 2023802 3 21227 16 45 16 50 0
        "1493515_3" 1493515 3 21227 20 16 20 20 0
        "1957996_2" 1957996 2 21227 15 30 15 45 0
        "2024382_2" 2024382 2 21227 14 22 14 30 0
        "2428229_1" 2428229 6 21227 15 20 15 40 0
        "1162081_3" 1162081 1 21227  8  0  8 46 0
        "1668877_1" 1668877 1 21227  8  0  8 40 0
        "1668877_2" 1668877 1 21227  7  0  8 15 0
        "1668877_3" 1668877 2 21227 14 30 15  0 0
        "2010146_4" 2010146 2 21227 14  1 14 10 0
        "2174589_1" 2174589 2 21227 14  0 14 20 0
        "1473198_1" 1473198 1 21227 10 30 10 50 0
        "2023802_2" 2023802 1 21227 11 58 12  0 0
        "246831_1"   246831 7 21227 19 20 19 30 0
        "820790_1"   820790 1 21227 12 10 12 40 0
        "1401789_2" 1401789 3 21227 10 30 10 45 0
        "556947_2"   556947 5 21227 17  0 17  5 0
        "111601_2"   111601 3 21227 19  0 20 30 0
        "246831_2"   246831 3 21227 17  0 17 10 0
        "2010611_1" 2010611 3 21227 14 30 15  0 0
        "653680_3"   653680 3 21227 14  0 14 20 0
        "2428229_1" 2428229 7 21227 15 45 16  0 0
        "653680_3"   653680 2 21227 12 30 12 45 0
        "175062_3"   175062 1 21227  8 25  8 30 0
        "1493515_2" 1493515 3 21227 16  0 17  0 0
        "1687815_1" 1687815 1 21227  7  0  8 30 0
        "1794675_3" 1794675 1 21227 10  0 10 15 0
        "258555_3"   258555 2 21227 15  5 15 15 0
        "1318545_2" 1318545 2 21227 11 52 12  0 0
        "527805_3"   527805 1 21227  6  0  7 30 0
        "1087049_2" 1087049 2 21227 13  0 13 30 0
        "2010611_1" 2010611 4 21227 16 15 16 25 0
        "1420142_2" 1420142 2 21227 10  0 10  5 0
        "1515560_2" 1515560 3 21227 19  0 19  5 0
        "175062_4"   175062 2 21227 15 20 15 30 0
        "2023802_1" 2023802 2 21227 13 30 15  0 0
        "1794675_2" 1794675 3 21227 18  0 18 15 0
        "2010611_2" 2010611 2 21227 15 30 16  0 0
        "2010611_4" 2010611 1 21227  7 20  7 30 0
        "1107787_2" 1107787 1 21227  6  0  7 15 0
        "1401789_1" 1401789 1 21227  7  4  7  5 0
        "2009099_3" 2009099 1 21227  7  1  7 40 0
        "1493515_2" 1493515 4 21227 19  0 20  0 0
        "1397823_3" 1397823 1 21227  7  0  8  0 0
        "1397769_2" 1397769 1 21227  8 30  9  0 0
        "556947_2"   556947 6 21227 18  0 18  5 0
        "1576271_3" 1576271 3 21227 18 54 19  0 0
        "2173351_1" 2173351 2 21227 21  0 22 30 0
        "2313620_1" 2313620 2 21227 11 30 11 35 0
        "312660_1"   312660 2 21227 18  0 18 30 0
        "1794832_2" 1794832 5 21227 20 30 20 40 0
        "1499476_1" 1499476 3 21227 17 40 18 15 0
        "2010611_3" 2010611 3 21227 17 51 17 55 0
        "1176130_1" 1176130 3 21227 12 56 13  0 0
        "1087049_1" 1087049 1 21227  9  0  9 45 0
        "1471133_1" 1471133 2 21227  9  0 10  0 0
        "653680_4"   653680 3 21227 14  0 14 20 0
        "653680_1"   653680 6 21227 16 10 16 20 0
        "2461844_2" 2461844 1 21227  8 45  9  0 0
        "1732362_3" 1732362 2 21227 19 30 20  0 0
        "1575565_1" 1575565 1 21227 10  0 10  5 0
        "1401789_2" 1401789 2 21227  9 10  9 20 0
        "1588143_2" 1588143 2 21227 20 30 20 59 0
        "2010146_2" 2010146 6 21227 18 44 18 50 0
        "2009350_2" 2009350 1 21227  9 30  9 40 0
        "2009099_3" 2009099 2 21227 15  0 15 30 0
        "2009099_2" 2009099 3 21227 15  0 15 30 0
        "1668877_2" 1668877 2 21227 14 30 15 20 0
        "1107787_2" 1107787 4 21227 21  0 21 20 0
        "2428229_2" 2428229 2 21227 14 20 14 40 0
        end
        format %td date

        Here is my code:

        Code:
        by individ_ID (start_time), sort: gen interlude = ///
            clockdiff(end_time[_n-1], start_time, "m") if _n > 1
        -start_time- and -end_time- are in the following "format":

        Code:
        01jan1960 19:45:00
        Thank you for your help Clyde.
        Best,

        Michael
        Last edited by Michael Duarte Goncalves; 23 Oct 2023, 03:34.

        Comment


        • #5
          I think the problems lie in the data, not the code.

          For example, consider the first person's first two entries

          Code:
          "1734079_1" 1734079 1 21223  9  0  9 40 0
          "1734079_2" 1734079 1 21223  9  0  9 20 0
          This person shows two overlapping trips and supposedly trip 2 ending before the end of trip 1.

          Comment


          • #6
            Hi Nick Cox:

            Thank you for your feedback. I have checked, and the entry that you post in #5 means the following:

            Code:
             "1734079_1" 1734079 1 21223  9  0  9 40 0  
             "1734079_2" 1734079 1 21223  9  0  9 20 0
            1734079 represents the household's ID, and "_1" the individual 1 from this household.
            So "1734079_2" represents the household's ID, and the second individual from this household.

            -ID_VIAJE- represents the corresponding journey of the household-individual individ_ID.

            So in this case, these two entries represent two different people.
            Sorry for the confusion.

            Best,

            Michael
            Last edited by Michael Duarte Goncalves; 23 Oct 2023, 04:40.

            Comment


            • #7
              Hi Nick Cox:

              You are right. There is a problem with the data!
              Thanks!

              I need your help please:
              I have merged two datasets. But this created some duplicated values after the -merge- operation, and I don't know why,


              Could you help me please? This is what I did:

              Code:
              *2. Merge with master data  clean_edm_trips_costs_pt_taxis_privatefuelcosts_togetherd.dta
              
              cd ${maindir}/04_pt_modes_cleaned/data/clean_edm
              use clean_edm_trips_costs_pt_taxis_privatefuelcosts_together.dta, clear
              
              
              
              tostring ID_HOGAR, gen(str_Hogar)
              tostring ID_IND, gen(str_Ind)
              tostring ID_VIAJE, gen(str_Viaje)
              
              
              gen viaje_ID = str_Hogar + "_" + str_Ind + "_" + str_Viaje
              order viaje_ID, after(individ_ID)
              drop str_Hogar str_Ind str_Viaje
              
              isid viaje_ID
              
              
              // merge EESTACIONA variable. EESTACIONA tells us where people parked. 
              merge m:m viaje_ID using ${data}/Microdata/etapas_edm_2018_parking_data.dta,  keep(3) keepusing(EESTACIONA workplace_parking free_parking ser_tariffs public_tariffs) 
              sort viaje_ID
              Here are the two datasets. Let's call the first shown "master dataset", and the second the "using dataset".

              "master dataset" after -merge-:

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str9 individ_ID str12 viaje_ID long ID_HOGAR byte(ID_IND ID_VIAJE) float date byte N_ETAPAS_POR_VIAJE
              "1000154_1" "1000154_1_1" 1000154 1 1 21328 1
              "1000154_1" "1000154_1_2" 1000154 1 2 21328 1
              "1000154_2" "1000154_2_1" 1000154 2 1 21328 2
              "1000154_2" "1000154_2_1" 1000154 2 1 21328 2
              "1000154_2" "1000154_2_2" 1000154 2 2 21328 2
              "1000154_2" "1000154_2_2" 1000154 2 2 21328 2
              "1000289_1" "1000289_1_1" 1000289 1 1 21325 3
              "1000289_1" "1000289_1_1" 1000289 1 1 21325 3
              "1000289_1" "1000289_1_1" 1000289 1 1 21325 3
              "1000289_1" "1000289_1_2" 1000289 1 2 21325 3
              "1000289_1" "1000289_1_2" 1000289 1 2 21325 3
              "1000289_1" "1000289_1_2" 1000289 1 2 21325 3
              "1000289_3" "1000289_3_1" 1000289 3 1 21325 1
              "1000289_3" "1000289_3_2" 1000289 3 2 21325 1
              "1000289_3" "1000289_3_3" 1000289 3 3 21325 1
              "1000289_3" "1000289_3_4" 1000289 3 4 21325 1
              "1000370_1" "1000370_1_1" 1000370 1 1 21325 1
              "1000370_1" "1000370_1_2" 1000370 1 2 21325 1
              "1000370_1" "1000370_1_3" 1000370 1 3 21325 1
              "1000370_1" "1000370_1_4" 1000370 1 4 21325 1
              "1000370_2" "1000370_2_1" 1000370 2 1 21325 1
              "1000370_2" "1000370_2_2" 1000370 2 2 21325 1
              "1000457_1" "1000457_1_1" 1000457 1 1 21325 1
              "1000457_1" "1000457_1_2" 1000457 1 2 21325 1
              "1000457_2" "1000457_2_1" 1000457 2 1 21325 1
              "1000457_2" "1000457_2_2" 1000457 2 2 21325 1
              "1000457_4" "1000457_4_1" 1000457 4 1 21325 2
              "1000457_4" "1000457_4_1" 1000457 4 1 21325 2
              "1000457_4" "1000457_4_2" 1000457 4 2 21325 2
              "1000457_4" "1000457_4_2" 1000457 4 2 21325 2
              "1000660_1" "1000660_1_1" 1000660 1 1 21325 2
              "1000660_1" "1000660_1_1" 1000660 1 1 21325 2
              "1000660_1" "1000660_1_2" 1000660 1 2 21325 2
              "1000660_1" "1000660_1_2" 1000660 1 2 21325 2
              "1001299_1" "1001299_1_1" 1001299 1 1 21264 1
              "1001299_1" "1001299_1_2" 1001299 1 2 21264 1
              "1001530_1" "1001530_1_1" 1001530 1 1 21333 1
              "1001530_1" "1001530_1_2" 1001530 1 2 21333 1
              "1001530_1" "1001530_1_3" 1001530 1 3 21333 1
              "1001530_1" "1001530_1_4" 1001530 1 4 21333 1
              "1001728_1" "1001728_1_1" 1001728 1 1 21298 1
              "1001728_1" "1001728_1_2" 1001728 1 2 21298 1
              "1001728_2" "1001728_2_1" 1001728 2 1 21298 1
              "1001728_2" "1001728_2_2" 1001728 2 2 21298 1
              "1001942_1" "1001942_1_1" 1001942 1 1 21286 1
              "1001942_1" "1001942_1_2" 1001942 1 2 21286 1
              "1001942_2" "1001942_2_1" 1001942 2 1 21286 1
              "1001942_2" "1001942_2_2" 1001942 2 2 21286 1
              "1002132_1" "1002132_1_1" 1002132 1 1 21285 2
              "1002132_1" "1002132_1_1" 1002132 1 1 21285 2
              "1002132_1" "1002132_1_2" 1002132 1 2 21285 2
              "1002132_1" "1002132_1_2" 1002132 1 2 21285 2
              "1002132_2" "1002132_2_1" 1002132 2 1 21285 3
              "1002132_2" "1002132_2_1" 1002132 2 1 21285 3
              "1002132_2" "1002132_2_1" 1002132 2 1 21285 3
              "1002132_2" "1002132_2_2" 1002132 2 2 21285 3
              "1002132_2" "1002132_2_2" 1002132 2 2 21285 3
              "1002132_2" "1002132_2_2" 1002132 2 2 21285 3
              "1002179_1" "1002179_1_1" 1002179 1 1 21313 1
              "1002179_1" "1002179_1_2" 1002179 1 2 21313 1
              "1002291_1" "1002291_1_1" 1002291 1 1 21314 1
              "1002291_1" "1002291_1_2" 1002291 1 2 21314 1
              "1002291_1" "1002291_1_3" 1002291 1 3 21314 1
              "1002291_1" "1002291_1_4" 1002291 1 4 21314 1
              "1002291_2" "1002291_2_1" 1002291 2 1 21314 1
              "1002291_2" "1002291_2_2" 1002291 2 2 21314 1
              "1002315_1" "1002315_1_1" 1002315 1 1 21314 1
              "1002315_1" "1002315_1_2" 1002315 1 2 21314 1
              "1002400_1" "1002400_1_1" 1002400 1 1 21314 1
              "1002400_1" "1002400_1_2" 1002400 1 2 21314 1
              "1002400_1" "1002400_1_3" 1002400 1 3 21314 1
              "1002400_1" "1002400_1_4" 1002400 1 4 21314 1
              "1002400_1" "1002400_1_5" 1002400 1 5 21314 1
              "1002400_1" "1002400_1_6" 1002400 1 6 21314 1
              "1002400_2" "1002400_2_1" 1002400 2 1 21314 1
              "1002400_2" "1002400_2_2" 1002400 2 2 21314 1
              "1002400_2" "1002400_2_3" 1002400 2 3 21314 1
              "1002400_2" "1002400_2_4" 1002400 2 4 21314 1
              "1002400_3" "1002400_3_1" 1002400 3 1 21314 2
              "1002400_3" "1002400_3_1" 1002400 3 1 21314 2
              "1002400_3" "1002400_3_2" 1002400 3 2 21314 2
              "1002400_3" "1002400_3_2" 1002400 3 2 21314 2
              "1002741_1" "1002741_1_1" 1002741 1 1 21237 1
              "1002741_2" "1002741_2_1" 1002741 2 1 21237 1
              "1002741_3" "1002741_3_1" 1002741 3 1 21237 1
              "1002741_3" "1002741_3_2" 1002741 3 2 21237 1
              "1002741_4" "1002741_4_1" 1002741 4 1 21237 1
              "1002741_4" "1002741_4_2" 1002741 4 2 21237 1
              "1002741_5" "1002741_5_1" 1002741 5 1 21237 2
              "1002741_5" "1002741_5_1" 1002741 5 1 21237 2
              "1002741_5" "1002741_5_2" 1002741 5 2 21237 2
              "1002741_5" "1002741_5_2" 1002741 5 2 21237 2
              "1002741_6" "1002741_6_1" 1002741 6 1 21237 1
              "1002809_1" "1002809_1_1" 1002809 1 1 21235 1
              "1002809_1" "1002809_1_2" 1002809 1 2 21235 1
              "1002809_1" "1002809_1_3" 1002809 1 3 21235 1
              "1002809_2" "1002809_2_1" 1002809 2 1 21235 1
              "1002809_2" "1002809_2_2" 1002809 2 2 21235 1
              "1002809_2" "1002809_2_3" 1002809 2 3 21235 1
              "1002956_2" "1002956_2_1" 1002956 2 1 21250 2
              end
              format %td date
              The master dataset aggregates the steps by each individual. In other terms, for example, "1000154_1" takes one step to go to work. On contrary, "1000154_2" takes two steps to go to work, The number of steps is given by variable "N_ETAPAS_POR_VIAJE".

              Here is the "using dataset":


              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str9 individ_ID str12 viaje_ID long ID_HOGAR byte(ID_IND ID_VIAJE ID_ETAPA EMODO EESTACIONA)
              "1000154_1" "1000154_1_1" 1000154 1 1 1 13 1
              "1000154_1" "1000154_1_2" 1000154 1 2 1 13 1
              "1000154_2" "1000154_2_1" 1000154 2 1 1  4 .
              "1000154_2" "1000154_2_1" 1000154 2 1 2  4 .
              "1000154_2" "1000154_2_2" 1000154 2 2 1  4 .
              "1000154_2" "1000154_2_2" 1000154 2 2 2  4 .
              "1000289_1" "1000289_1_1" 1000289 1 1 2  4 .
              "1000289_1" "1000289_1_1" 1000289 1 1 3  7 .
              "1000289_1" "1000289_1_1" 1000289 1 1 1  4 .
              "1000289_1" "1000289_1_2" 1000289 1 2 1  7 .
              "1000289_1" "1000289_1_2" 1000289 1 2 3  4 .
              "1000289_1" "1000289_1_2" 1000289 1 2 2  4 .
              "1000289_3" "1000289_3_1" 1000289 3 1 1  1 .
              "1000289_3" "1000289_3_2" 1000289 3 2 1  1 .
              "1000289_3" "1000289_3_3" 1000289 3 3 1  1 .
              "1000289_3" "1000289_3_4" 1000289 3 4 1  1 .
              "1000370_1" "1000370_1_1" 1000370 1 1 1 13 1
              "1000370_1" "1000370_1_2" 1000370 1 2 1 13 1
              "1000370_1" "1000370_1_3" 1000370 1 3 1 13 1
              "1000370_1" "1000370_1_4" 1000370 1 4 1 13 1
              "1000370_2" "1000370_2_1" 1000370 2 1 1 13 1
              "1000370_2" "1000370_2_2" 1000370 2 2 1 13 1
              "1000457_1" "1000457_1_1" 1000457 1 1 1  4 .
              "1000457_1" "1000457_1_2" 1000457 1 2 1  4 .
              "1000457_2" "1000457_2_1" 1000457 2 1 1  1 .
              "1000457_2" "1000457_2_2" 1000457 2 2 1  1 .
              "1000457_4" "1000457_4_1" 1000457 4 1 1  4 .
              "1000457_4" "1000457_4_1" 1000457 4 1 2  6 .
              "1000457_4" "1000457_4_2" 1000457 4 2 1  6 .
              "1000457_4" "1000457_4_2" 1000457 4 2 2  4 .
              "1000660_1" "1000660_1_1" 1000660 1 1 2  4 .
              "1000660_1" "1000660_1_1" 1000660 1 1 1  4 .
              "1000660_1" "1000660_1_2" 1000660 1 2 2  4 .
              "1000660_1" "1000660_1_2" 1000660 1 2 1  4 .
              "1001299_1" "1001299_1_1" 1001299 1 1 1  4 .
              "1001299_1" "1001299_1_2" 1001299 1 2 1  4 .
              "1001530_1" "1001530_1_1" 1001530 1 1 1  6 .
              "1001530_1" "1001530_1_2" 1001530 1 2 1  6 .
              "1001530_1" "1001530_1_3" 1001530 1 3 1  1 .
              "1001530_1" "1001530_1_4" 1001530 1 4 1  1 .
              "1001728_1" "1001728_1_1" 1001728 1 1 1  1 .
              "1001728_1" "1001728_1_2" 1001728 1 2 1  1 .
              "1001728_2" "1001728_2_1" 1001728 2 1 1  1 .
              "1001728_2" "1001728_2_2" 1001728 2 2 1  1 .
              "1001942_1" "1001942_1_1" 1001942 1 1 1  1 .
              "1001942_1" "1001942_1_2" 1001942 1 2 1  1 .
              "1001942_2" "1001942_2_1" 1001942 2 1 1  1 .
              "1001942_2" "1001942_2_2" 1001942 2 2 1  1 .
              "1002132_1" "1002132_1_1" 1002132 1 1 1  4 .
              "1002132_1" "1002132_1_1" 1002132 1 1 2  4 .
              "1002132_1" "1002132_1_2" 1002132 1 2 2  4 .
              "1002132_1" "1002132_1_2" 1002132 1 2 1  4 .
              "1002132_2" "1002132_2_1" 1002132 2 1 3 12 .
              "1002132_2" "1002132_2_1" 1002132 2 1 1  4 .
              "1002132_2" "1002132_2_1" 1002132 2 1 2  4 .
              "1002132_2" "1002132_2_2" 1002132 2 2 3  4 .
              "1002132_2" "1002132_2_2" 1002132 2 2 2  4 .
              "1002132_2" "1002132_2_2" 1002132 2 2 1 12 .
              "1002179_1" "1002179_1_1" 1002179 1 1 1  1 .
              "1002179_1" "1002179_1_2" 1002179 1 2 1  1 .
              "1002291_1" "1002291_1_1" 1002291 1 1 1  1 .
              "1002291_1" "1002291_1_2" 1002291 1 2 1  1 .
              "1002291_1" "1002291_1_3" 1002291 1 3 1  1 .
              "1002291_1" "1002291_1_4" 1002291 1 4 1  1 .
              "1002291_2" "1002291_2_1" 1002291 2 1 1  1 .
              "1002291_2" "1002291_2_2" 1002291 2 2 1  1 .
              "1002315_1" "1002315_1_1" 1002315 1 1 1  1 .
              "1002315_1" "1002315_1_2" 1002315 1 2 1  1 .
              "1002400_1" "1002400_1_1" 1002400 1 1 1  1 .
              "1002400_1" "1002400_1_2" 1002400 1 2 1  1 .
              "1002400_1" "1002400_1_3" 1002400 1 3 1  1 .
              "1002400_1" "1002400_1_4" 1002400 1 4 1  1 .
              "1002400_1" "1002400_1_5" 1002400 1 5 1  1 .
              "1002400_1" "1002400_1_6" 1002400 1 6 1  1 .
              "1002400_2" "1002400_2_1" 1002400 2 1 1  1 .
              "1002400_2" "1002400_2_2" 1002400 2 2 1  1 .
              "1002400_2" "1002400_2_3" 1002400 2 3 1  1 .
              "1002400_2" "1002400_2_4" 1002400 2 4 1  1 .
              "1002400_3" "1002400_3_1" 1002400 3 1 2  4 .
              "1002400_3" "1002400_3_1" 1002400 3 1 1  4 .
              "1002400_3" "1002400_3_2" 1002400 3 2 1  4 .
              "1002400_3" "1002400_3_2" 1002400 3 2 2  4 .
              "1002741_1" "1002741_1_1" 1002741 1 1 1 13 1
              "1002741_2" "1002741_2_1" 1002741 2 1 1  1 .
              "1002741_3" "1002741_3_1" 1002741 3 1 1  2 .
              "1002741_3" "1002741_3_2" 1002741 3 2 1  2 .
              "1002741_4" "1002741_4_1" 1002741 4 1 1  4 .
              "1002741_4" "1002741_4_2" 1002741 4 2 1  4 .
              "1002741_5" "1002741_5_1" 1002741 5 1 1  4 .
              "1002741_5" "1002741_5_1" 1002741 5 1 2  4 .
              "1002741_5" "1002741_5_2" 1002741 5 2 2  4 .
              "1002741_5" "1002741_5_2" 1002741 5 2 1  4 .
              "1002741_6" "1002741_6_1" 1002741 6 1 1 14 .
              "1002809_1" "1002809_1_1" 1002809 1 1 1 13 2
              "1002809_1" "1002809_1_2" 1002809 1 2 1 13 5
              "1002809_1" "1002809_1_3" 1002809 1 3 1 13 5
              "1002809_2" "1002809_2_1" 1002809 2 1 1 13 6
              "1002809_2" "1002809_2_2" 1002809 2 2 1 13 2
              "1002809_2" "1002809_2_3" 1002809 2 3 1 13 5
              "1002956_2" "1002956_2_1" 1002956 2 1 2  4 .
              end
              the "using dataset" details each stage of the journey undertaken by individual i. This is reflected in the variable ID_ETAPA.

              When I merge the two, I get more data than I had before, in some cases repeated several times.

              Code:
              . // merge EESTACIONA variable. EESTACIONA tells us where people parked. 
              . merge m:m viaje_ID using ${data}/Microdata/etapas_edm_2018_parking_data.dta,  keep(3) keepusing(EESTACIONA workplace_parking free_parking ser_tariffs public_tariffs) 
              
                  Result                      Number of obs
                  -----------------------------------------
                  Not matched                             0
                  Matched                           257,138  (_merge==3)
                  -----------------------------------------

              The master dataset has originally (before merging):

              Code:
              . count
                222,744
              Could you help me please?
              There is a problem in my data matching stage.

              I apologize for the confusion.
              Best wishes,

              Michael
              Last edited by Michael Duarte Goncalves; 23 Oct 2023, 08:46.

              Comment


              • #8
                It is easy to identify the problem here. You committed what is, in Stata, the equivalent of a mortal sin. You used -merge m:m-, which produces data salad in almost every situation. Even the official StataCorp documentation for the -merge- command says you should pretty much never use this.

                The harder question is discovering why you were lead to do that and what you should do instead. The variables common to both master and using are ID_HOGAR, ID_IND, ID_VIAJE, individ_ID and viaje_ID. The variables individ_ID and viaje_ID are redundant variables that just contain the "tails" of ID_IND and ID_VIAJE, respectively, so we don't have to think about them. In the using dataset, the combination of ID_HOGAR, ID_IND, ID_VIAJE, and ID_ETAPA uniquely identify observations. But if you look at just ID_HOGAR, ID_IND, ID_VIAJE, they do not uniquely identify observations.

                So, in order to do this -merge- one needs to have ID_HOGAR, ID_IND, and ID_VIAJE uniquely identifying observations in the master data set. Alas, they do not. There are numerous instances of duplicates, and some triplicate observations having the same values for all three of these variables. Now, interestingly, it turns out that these observations, at least in the example data, are all exact duplicates on every variable. That is, they are complete, exact copies of each other. The surplus observations contain no additional information and do not belong in the data set. So the underlying problem is with the master data set.

                Now, it is easy enough to get rid of the surplus observations with -duplicates drop-. But I don't recommend doing that, at least not yet. The presence of these in the data probably indicates an error in the data management that led up to the creation of that data set. And where there is one error found, others may still lurk undetected. So I recommend you do a complete review of the work that led to the creation of the master data set to find out where and how those duplicate observations crept in, and then fix the code to remove that and any other errors you find along the way. (If you are truly unable to find or fix the source of the duplicate observations, then use -duplicates drop- before you attempt the -merge-. But it is usually possible to find the source, and that is the safer approach.) Once you have done that, the correct way to combine these data sets will be:

                Code:
                use master_data, clear // AFTER YOU HAVE GOTTEN RID OF THE DUPLICATES
                merge 1:m ID_HOGAR ID_IND ID_VIAJE using using_data
                And remember, if you are tempted to use -merge m:m- it almost always means one of the following:
                1. The data are wrong.
                2. The data are OK but you don't understand them properly.
                3. You need to use -joinby-, not -merge-

                Comment


                • #9
                  Hi Clyde Schechter:

                  Thank you for your excellent explanations in #8.

                  I'm afraid I've made a few mistakes with -merge m:m-, as I've used it quite a lot.

                  So I will open a new post about it, to see if anyone can help me...

                  Thanks again for your explanations.
                  Have a nice day,

                  Michael

                  Comment

                  Working...
                  X