Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fill the . values of duplicates based on two variables

    Dear Stata users,

    I am using Stata 17.

    Based on two variables
    Code:
    date & urbanname
    I have duplicates. Each duplicate (based on these variables) has values in other variables whereas the other duplicate has ".".

    I would like to create one row with all the values of each duplicate (to keep all of the information of each duplicate) and drop the duplicates.

    How shall I proceed?

    My dataset is of the following form

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(mainid date) strL urbanname float(pm25 date_start)
    25 714 "Hanoi"      9.666667 617
    25 708 "Hanoi"          13.8 617
    25 713 "Hanoi"     14.615385 617
    25 737 "Hanoi"      20.48387 617
    25 738 "Hanoi"          23.6 617
    25 739 "Hanoi"      23.62963 617
    25 743 "Hanoi"          25.7 617
    25 720 "Hanoi"      27.16129 617
    25 690 "Hanoi"      29.82143 617
    25 725 "Hanoi"     30.096775 617
    25 726 "Hanoi"          30.3 617
    27 726 "Xiamen"         30.6 695
    25 750 "Hanoi"     30.857143 617
    25 731 "Hanoi"     32.366665 617
    25 732 "Hanoi"     33.533333 617
    25 730 "Hanoi"     38.032257 617
    25 697 "Hanoi"          38.7 617
    25 727 "Hanoi"            39 617
    27 750 "Xiamen"     40.30435 695
    25 740 "Hanoi"      40.64516 617
    27 725 "Xiamen"     40.74194 695
    40 720 "Hyderabad"  40.85185 694
    40 725 "Hyderabad"        41 694
    25 721 "Hanoi"      41.13334 617
    25 689 "Hanoi"         41.36 617
    27 738 "Xiamen"         43.4 695
    25 733 "Hanoi"      44.55172 617
    25 694 "Hanoi"      44.56522 617
    34 737 "Guiyang"    45.06452 695
    27 737 "Xiamen"     45.29032 695
    35 750 "Dongguan"   46.43478 676
    27 742 "Xiamen"     46.45161 695
    27 720 "Xiamen"     46.51613 695
    36 737 "Changzhou"  46.54839 716
    25 693 "Hanoi"            47 617
    34 743 "Guiyang"        47.1 695
    43 737 "Hohhot"     47.55914 719
    27 732 "Xiamen"     47.66667 695
    25 742 "Hanoi"      47.83871 617
    25 679 "Hanoi"         47.92 617
    27 747 "Xiamen"     47.96429 695
    29 720 "Urumqi"     48.16129 705
    43 708 "Hohhot"     48.82796 719
    42 722 "Jarkarta"   49.06452 711
    40 726 "Hyderabad"  49.33333 694
    27 743 "Xiamen"     49.46667 695
    40 737 "Hyderabad"  49.90322 694
    27 739 "Xiamen"     50.16129 695
    25 676 "Hanoi"         50.25 617
    42 700 "Jarkarta"   50.62827 711
    25 745 "Hanoi"          50.7 617
    25 735 "Hanoi"      51.07143 617
    42 686 "Jarkarta"       51.1 711
    29 750 "Urumqi"     51.30435 705
    29 738 "Urumqi"     51.53333 705
    29 725 "Urumqi"      51.6129 705
    25 751 "Hanoi"      51.76923 617
    35 725 "Dongguan"   51.87097 676
    34 738 "Guiyang"          52 695
    25 695 "Hanoi"      52.05556 617
    25 729 "Hanoi"      52.53333 617
    35 726 "Dongguan"   52.56667 676
    43 720 "Hohhot"     52.75269 719
    43 731 "Hohhot"     52.82222 719
    43 750 "Hohhot"     52.86956 719
    39 732 "Fuzhou"     53.04762 676
    29 737 "Urumqi"     53.25806 705
    43 738 "Hohhot"     53.83333 719
    43 751 "Hohhot"     53.90322 719
    40 743 "Hyderabad"      54.1 694
    34 732 "Guiyang"    54.52381 695
    27 751 "Xiamen"     54.67742 695
    39 742 "Fuzhou"     54.80645 676
    29 714 "Urumqi"     54.93333 705
    27 727 "Xiamen"           55 695
    33 737 "Hefei"            55 683
    43 743 "Hohhot"     55.13334 719
    34 750 "Guiyang"    55.21739 695
    40 713 "Hyderabad"  55.35484 694
    40 708 "Hyderabad"        56 694
    34 742 "Guiyang"    56.06452 695
    34 751 "Guiyang"    56.06452 695
    43 707 "Hohhot"     56.15556 719
    39 737 "Fuzhou"     56.16129 676
    25 684 "Hanoi"      56.22727 617
    43 672 "Hohhot"     56.25806 719
    39 750 "Fuzhou"      56.3913 676
    27 713 "Xiamen"     56.54839 695
    40 731 "Hyderabad"  56.69231 694
    27 708 "Xiamen"     56.77419 695
    27 722 "Xiamen"           57 695
    25 678 "Hanoi"          57.5 617
    35 739 "Dongguan"   57.54839 676
    25 728 "Hanoi"      57.58621 617
    42 730 "Jarkarta"     58.125 711
    43 714 "Hohhot"         58.4 719
    27 730 "Xiamen"     58.51613 695
    43 696 "Hohhot"     58.98925 719
    39 743 "Fuzhou"     59.03333 676
    34 739 "Guiyang"    59.19355 695
    end
    format %tm date
    format %tm date_start
    Thank you for your help!
    Last edited by Benoit Decoco; 10 Jul 2022, 16:44.

  • #2
    I'm so sorry, I stopped understanding at
    has values in other variables whereas the other duplicate has ".".

    I would like to create one row with all the values of each duplicate (to keep all of the information of each duplicate) and drop the duplicates.
    Could you maybe provide an example (let's say with Hannoi or another city) of what exactly you'd like your dataset to ultimately look like?

    Comment


    • #3
      Dear Jared,

      Please forgive me, as I am not an experienced Stata user / programmer and my native language is not english, I may have difficulties explaining clearly what I wish to do.

      For example if I have the following two observations in my dataset
      Code:
      date urbanname pm25  co
      2019m1 Hannoi  3.309 .
      2019m1 Hannoi  .    4.3938
      I'd like to know the commands I should use to obtain the following observation and drop the two precedent observations
      Code:
      date urbanname pm25  co
      2019m1 Hannoi 3.309 4.3938
      Thank you !
      Last edited by Benoit Decoco; 10 Jul 2022, 17:33.

      Comment


      • #4
        That's quite alright.

        Let's stick with the real example you gave above. I don't see co in your dataset; give the example that that fully illustrates the problem, please. In fact, just to make this super easy, show me the results of
        Code:
        dataex if inlist(mainid,25,36,34,39)
        if this example adequately describes the problem in question.

        Comment


        • #5
          Originally posted by Benoit Decoco View Post
          Dear Jared,

          Please forgive me, as I am not an experienced Stata user / programmer and my native language is not english, I may have difficulties explaining clearly what I wish to do.

          For example if I have the following two observations in my dataset
          Code:
          date urbanname pm25 co
          2019m1 Hannoi 3.309 .
          2019m1 Hannoi . 4.3938
          I'd like to know the commands I should use to obtain the following observation and drop the two precedent observations
          Code:
          date urbanname pm25 co
          2019m1 Hannoi 3.309 4.3938
          Thank you !
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str6(date urbanname) float(pm25 co1)
          "2019m1" "Hannoi" 3.309 .  
          "2019m1" "Hannoi" . 4.3938
          end
          
          collapse (firstnm) pm25 co1, by(date urbanname)
          Res.:

          Code:
          . l
          
               +------------------------------------+
               |   date   urbann~e    pm25      co1 |
               |------------------------------------|
            1. | 2019m1     Hannoi   3.309   4.3938 |
               +------------------------------------+

          Comment


          • #6
            I have exactly 167 variables in the dataset, that is why I could not put every variable in the dataex. I managed to order the dataset in a way such that you can see what I am speaking about with a dataex

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input long urbancode float date strL urbanname float(pm25 slope_25km)
            21275   666 "Hyderabad"            . 1.2345366
             1015   688 "Hyderabad"     165.3226         .
            22041   670 "Hyderabad"            .  .7591518
             1015   684 "Hyderabad"     90.96774         .
            21275   689 "Hyderabad"            . 1.2345366
             1015   667 "Hyderabad"    130.17241         .
            21275   694 "Hyderabad"            . 1.2345366
            22041   682 "Hyderabad"            .  .7591518
             1015   681 "Hyderabad"        169.5         .
            21275   683 "Hyderabad"            . 1.2345366
             1015   660 "Hyderabad"     94.53333         .
            21275   662 "Hyderabad"            . 1.2345366
            21275   665 "Hyderabad"            . 1.2345366
             1015   671 "Hyderabad"    114.65517         .
            22041   695 "Hyderabad"            .  .7591518
            21275   687 "Hyderabad"            . 1.2345366
            21275   690 "Hyderabad"            . 1.2345366
             1015   674 "Hyderabad"     171.4516         .
            22041   675 "Hyderabad"            .  .7591518
            21275   673 "Hyderabad"            . 1.2345366
            21275   685 "Hyderabad"            . 1.2345366
            21275   676 "Hyderabad"            . 1.2345366
            21275   692 "Hyderabad"            . 1.2345366
             1015   663 "Hyderabad"    151.10715         .
            21275   650 "Hyderabad"            . 1.2345366
            22041   672 "Hyderabad"            .  .7591518
             1015   679 "Hyderabad"     109.2258         .
             1015   691 "Hyderabad"    126.86667         .
             1015   669 "Hyderabad"    150.07408         .
            21275   677 "Hyderabad"            . 1.2345366
            21275   680 "Hyderabad"            . 1.2345366
             1015   661 "Hyderabad"    126.88461         .
            22041   686 "Hyderabad"            .  .7591518
            21275   678 "Hyderabad"            . 1.2345366
             1015   668 "Hyderabad"    141.93333         .
             1015   664 "Hyderabad"    149.24138         .
            22041   180 "Hyderabad"            .  .7591518
            22936 -1102 "Birmingham"           . 3.7266955
            23233  -408 "Valencia"             .  8.073236
            22936  -255 "Birmingham"           . 3.7266955
            22567   648 "Valencia"             .  2.546318
            22041  -339 "Hyderabad"            .  .7591518
            21275   630 "Hyderabad"            . 1.2345366
            22041  -835 "Hyderabad"            .  .7591518
            23233   171 "Valencia"             .  8.073236
            22041   -18 "Hyderabad"            .  .7591518
            21275  -413 "Hyderabad"            . 1.2345366
            22567   232 "Valencia"             .  2.546318
            22041  -652 "Hyderabad"            .  .7591518
            22041  -579 "Hyderabad"            .  .7591518
            23233  -356 "Valencia"             .  8.073236
            22936  -150 "Birmingham"           . 3.7266955
            23233  -521 "Valencia"             .  8.073236
            21275  -388 "Hyderabad"            . 1.2345366
            21275   478 "Hyderabad"            . 1.2345366
            23233   457 "Valencia"             .  8.073236
            22835  -224 "Birmingham"           . 1.8021827
            22567    85 "Valencia"             .  2.546318
            23233  -633 "Valencia"             .  8.073236
            22936   -27 "Birmingham"           . 3.7266955
            21275   524 "Hyderabad"            . 1.2345366
            22835  -117 "Birmingham"           . 1.8021827
            21275  -264 "Hyderabad"            . 1.2345366
            21275 -1000 "Hyderabad"            . 1.2345366
            22835  -686 "Birmingham"           . 1.8021827
            21275    66 "Hyderabad"            . 1.2345366
            22835  -599 "Birmingham"           . 1.8021827
            21275 -1070 "Hyderabad"            . 1.2345366
            22567  -703 "Valencia"             .  2.546318
            22936   284 "Birmingham"           . 3.7266955
            22936   179 "Birmingham"           . 3.7266955
            21275   148 "Hyderabad"            . 1.2345366
            22041  -970 "Hyderabad"            .  .7591518
            22041 -1017 "Hyderabad"            .  .7591518
            22936  -433 "Birmingham"           . 3.7266955
            22835   622 "Birmingham"           . 1.8021827
            20519   690 "Guiyang"              .  6.462536
            23233   642 "Valencia"             .  8.073236
            22041  -649 "Hyderabad"            .  .7591518
            23233    91 "Valencia"             .  8.073236
            22835    44 "Birmingham"           . 1.8021827
            22567  -906 "Valencia"             .  2.546318
            22936  -102 "Birmingham"           . 3.7266955
            22936  -424 "Birmingham"           . 3.7266955
            22041  -333 "Hyderabad"            .  .7591518
            22835  -428 "Birmingham"           . 1.8021827
            23233  -916 "Valencia"             .  8.073236
            23233  -668 "Valencia"             .  8.073236
            23233  -718 "Valencia"             .  8.073236
            22567   -91 "Valencia"             .  2.546318
            22567    51 "Valencia"             .  2.546318
            20669   656 "Shijiazhuang"         . 1.7961366
            22936  -375 "Birmingham"           . 3.7266955
            20669   670 "Shijiazhuang"         . 1.7961366
            21275  -231 "Hyderabad"            . 1.2345366
            22835    43 "Birmingham"           . 1.8021827
            23233    -9 "Valencia"             .  8.073236
            22936  -945 "Birmingham"           . 3.7266955
            21275  -829 "Hyderabad"            . 1.2345366
            22936  -595 "Birmingham"           . 3.7266955
            end
            format %tm date
            You see that some of the rows have the same date and the same urbanname but some of them have a stored value in pm2.5 and don't have a stored value in slope_25km while some others have the exact opposite.
            I'd like to have one unique row with both of pm25 and slope_25km columns filled with a value.

            Comment


            • #7
              Originally posted by Andrew Musau View Post

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str6(date urbanname) float(pm25 co1)
              "2019m1" "Hannoi" 3.309 .
              "2019m1" "Hannoi" . 4.3938
              end
              
              collapse (firstnm) pm25 co1, by(date urbanname)
              Res.:

              Code:
              . l
              
              +------------------------------------+
              | date urbann~e pm25 co1 |
              |------------------------------------|
              1. | 2019m1 Hannoi 3.309 4.3938 |
              +------------------------------------+
              Dear Andrew,

              This would perfectly work if I only had these variables, forgive me I did not properly describe my dataset.

              Comment


              • #8
                In your data example, there are no duplicates of the variables date and urbanname, so it is not clear what you want.

                Code:
                isid date urbanname

                Comment


                • #9
                  Andrew Musau the collapse firstnm syntax was what I was trying to lead up to, but as you note it wasn't in the original data example.

                  Comment


                  • #10
                    The command
                    Code:
                    isid date urbnanme
                    returns

                    Code:
                    variables date and urbanname do not uniquely identify the observations
                    r(459);
                    This is an example for one city.
                    See the duplicates 695 for instance.

                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input strL urbanname float(date pm25 co aod3K_mean_Aqua_10km) .
                    "Wenzhou" 703 114.06451  8.064516         .
                    "Wenzhou" 702 102.83334       6.6         .
                    "Wenzhou" 701   90.3871  5.935484         .
                    "Wenzhou" 700 130.74193  8.387096         .
                    "Wenzhou" 699 121.10714  6.678571         .
                    "Wenzhou" 698  96.40741  7.111111         .
                    "Wenzhou" 697 130.96666       7.5         .
                    "Wenzhou" 696  83.25806  6.709677         .
                    "Wenzhou" 695         .         .     .4394
                    "Wenzhou" 695  96.42308  6.576923         .
                    "Wenzhou" 694  95.06451  6.516129         .
                    "Wenzhou" 694         .         .  .3622546
                    "Wenzhou" 693         .         .  .4417194
                    "Wenzhou" 693     116.1  7.166667         .
                    "Wenzhou" 692  147.6129  9.645162         .
                    "Wenzhou" 692         .         .   .796245
                    "Wenzhou" 691         .         .  .7701847
                    "Wenzhou" 691 112.96774   9.16129         .
                    "Wenzhou" 690         .         .  .8079634
                    "Wenzhou" 690     119.1       9.9         .
                    "Wenzhou" 689         .         . .58283335
                    "Wenzhou" 689 100.25806  7.290323         .
                    "Wenzhou" 688 125.76667 10.354838         .
                    "Wenzhou" 688         .         .  .6874976
                    "Wenzhou" 687 136.46428  8.535714         .
                    "Wenzhou" 687         .         .  .6338925
                    "Wenzhou" 686 148.54839  8.193548         .
                    "Wenzhou" 686         .         .  .6966229
                    "Wenzhou" 685 140.76666 12.066667         .
                    "Wenzhou" 685         .         .  .5206547
                    "Wenzhou" 684         .         .  .6713136
                    "Wenzhou" 684  98.96774  6.774194         .
                    "Wenzhou" 683         .         .  .3829562
                    "Wenzhou" 683      99.4  7.066667         .
                    "Wenzhou" 682         .         .  .3242331
                    "Wenzhou" 682  99.45161  8.903226         .
                    "Wenzhou" 681 125.33334 11.933333         .
                    "Wenzhou" 681         .         . .19232053
                    "Wenzhou" 680 141.45161 14.258064         .
                    "Wenzhou" 680         .         .  .7010399
                    "Wenzhou" 679         .         .  .7567334
                    "Wenzhou" 679 132.06451 10.903226         .
                    "Wenzhou" 678         .         .  .7867171
                    "Wenzhou" 678       111       9.5         .
                    "Wenzhou" 677         .         . .57805836
                    "Wenzhou" 677 102.03226  8.451612         .
                    "Wenzhou" 676         .         .  .7786242
                    "Wenzhou" 676 127.78571 13.310345         .
                    "Wenzhou" 675         .         .  .6825334
                    "Wenzhou" 675 128.48276 13.034483         .
                    "Wenzhou" 674 133.87097 11.633333         .
                    "Wenzhou" 674         .         .  .7551135
                    "Wenzhou" 673         .         .  .4213091
                    "Wenzhou" 673       136      12.9         .
                    end
                    format %tm date
                    I'd like to uniquely identify an observation with date and urbanname and fill the missing values of variable of a duplicate by the values contained in the other duplicate. Meaning, for instance with duplicate 695

                    Code:
                    input strL urbanname float(date pm25 co aod3K_mean_Aqua_10km) .
                    "Wenzhou" 695         .         .     .4394
                    "Wenzhou" 695  96.42308  6.576923         .
                    becomes


                    Code:
                    input strL urbanname float(date pm25 co aod3K_mean_Aqua_10km) .
                    "Wenzhou" 695    96.42308         6.576923      .4394

                    Comment


                    • #11
                      I already gave you the code in #5. With many variables, you can

                      Code:
                      ds urbanname date, not
                      collapse (firstnm) `r(varlist)', by(urbanname date)
                      isid urbanname date

                      Comment


                      • #12
                        Thank you it worked perfectly ! Pardon me again for the misunderstanding.

                        Comment

                        Working...
                        X