Fill the . values of duplicates based on two variables

Benoit Decoco

Join Date: Jun 2022
Posts: 17

Fill the . values of duplicates based on two variables

10 Jul 2022, 16:41

Dear Stata users,

I am using Stata 17.

Based on two variables

Code:

date & urbanname

I have duplicates. Each duplicate (based on these variables) has values in other variables whereas the other duplicate has ".".

I would like to create one row with all the values of each duplicate (to keep all of the information of each duplicate) and drop the duplicates.

How shall I proceed?

My dataset is of the following form

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(mainid date) strL urbanname float(pm25 date_start)
25 714 "Hanoi"      9.666667 617
25 708 "Hanoi"          13.8 617
25 713 "Hanoi"     14.615385 617
25 737 "Hanoi"      20.48387 617
25 738 "Hanoi"          23.6 617
25 739 "Hanoi"      23.62963 617
25 743 "Hanoi"          25.7 617
25 720 "Hanoi"      27.16129 617
25 690 "Hanoi"      29.82143 617
25 725 "Hanoi"     30.096775 617
25 726 "Hanoi"          30.3 617
27 726 "Xiamen"         30.6 695
25 750 "Hanoi"     30.857143 617
25 731 "Hanoi"     32.366665 617
25 732 "Hanoi"     33.533333 617
25 730 "Hanoi"     38.032257 617
25 697 "Hanoi"          38.7 617
25 727 "Hanoi"            39 617
27 750 "Xiamen"     40.30435 695
25 740 "Hanoi"      40.64516 617
27 725 "Xiamen"     40.74194 695
40 720 "Hyderabad"  40.85185 694
40 725 "Hyderabad"        41 694
25 721 "Hanoi"      41.13334 617
25 689 "Hanoi"         41.36 617
27 738 "Xiamen"         43.4 695
25 733 "Hanoi"      44.55172 617
25 694 "Hanoi"      44.56522 617
34 737 "Guiyang"    45.06452 695
27 737 "Xiamen"     45.29032 695
35 750 "Dongguan"   46.43478 676
27 742 "Xiamen"     46.45161 695
27 720 "Xiamen"     46.51613 695
36 737 "Changzhou"  46.54839 716
25 693 "Hanoi"            47 617
34 743 "Guiyang"        47.1 695
43 737 "Hohhot"     47.55914 719
27 732 "Xiamen"     47.66667 695
25 742 "Hanoi"      47.83871 617
25 679 "Hanoi"         47.92 617
27 747 "Xiamen"     47.96429 695
29 720 "Urumqi"     48.16129 705
43 708 "Hohhot"     48.82796 719
42 722 "Jarkarta"   49.06452 711
40 726 "Hyderabad"  49.33333 694
27 743 "Xiamen"     49.46667 695
40 737 "Hyderabad"  49.90322 694
27 739 "Xiamen"     50.16129 695
25 676 "Hanoi"         50.25 617
42 700 "Jarkarta"   50.62827 711
25 745 "Hanoi"          50.7 617
25 735 "Hanoi"      51.07143 617
42 686 "Jarkarta"       51.1 711
29 750 "Urumqi"     51.30435 705
29 738 "Urumqi"     51.53333 705
29 725 "Urumqi"      51.6129 705
25 751 "Hanoi"      51.76923 617
35 725 "Dongguan"   51.87097 676
34 738 "Guiyang"          52 695
25 695 "Hanoi"      52.05556 617
25 729 "Hanoi"      52.53333 617
35 726 "Dongguan"   52.56667 676
43 720 "Hohhot"     52.75269 719
43 731 "Hohhot"     52.82222 719
43 750 "Hohhot"     52.86956 719
39 732 "Fuzhou"     53.04762 676
29 737 "Urumqi"     53.25806 705
43 738 "Hohhot"     53.83333 719
43 751 "Hohhot"     53.90322 719
40 743 "Hyderabad"      54.1 694
34 732 "Guiyang"    54.52381 695
27 751 "Xiamen"     54.67742 695
39 742 "Fuzhou"     54.80645 676
29 714 "Urumqi"     54.93333 705
27 727 "Xiamen"           55 695
33 737 "Hefei"            55 683
43 743 "Hohhot"     55.13334 719
34 750 "Guiyang"    55.21739 695
40 713 "Hyderabad"  55.35484 694
40 708 "Hyderabad"        56 694
34 742 "Guiyang"    56.06452 695
34 751 "Guiyang"    56.06452 695
43 707 "Hohhot"     56.15556 719
39 737 "Fuzhou"     56.16129 676
25 684 "Hanoi"      56.22727 617
43 672 "Hohhot"     56.25806 719
39 750 "Fuzhou"      56.3913 676
27 713 "Xiamen"     56.54839 695
40 731 "Hyderabad"  56.69231 694
27 708 "Xiamen"     56.77419 695
27 722 "Xiamen"           57 695
25 678 "Hanoi"          57.5 617
35 739 "Dongguan"   57.54839 676
25 728 "Hanoi"      57.58621 617
42 730 "Jarkarta"     58.125 711
43 714 "Hohhot"         58.4 719
27 730 "Xiamen"     58.51613 695
43 696 "Hohhot"     58.98925 719
39 743 "Fuzhou"     59.03333 676
34 739 "Guiyang"    59.19355 695
end
format %tm date
format %tm date_start

Thank you for your help!

Last edited by Benoit Decoco; 10 Jul 2022, 16:44.

Tags: data, duplicates, rows

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

10 Jul 2022, 17:11

I'm so sorry, I stopped understanding at

has values in other variables whereas the other duplicate has ".".

I would like to create one row with all the values of each duplicate (to keep all of the information of each duplicate) and drop the duplicates.

Could you maybe provide an example (let's say with Hannoi or another city) of what exactly you'd like your dataset to ultimately look like?
Comment
Benoit Decoco

Join Date: Jun 2022

Posts: 17
#3

10 Jul 2022, 17:20

Dear Jared,

Please forgive me, as I am not an experienced Stata user / programmer and my native language is not english, I may have difficulties explaining clearly what I wish to do.

For example if I have the following two observations in my dataset

Code:

date urbanname pm25 co 2019m1 Hannoi 3.309 . 2019m1 Hannoi . 4.3938

I'd like to know the commands I should use to obtain the following observation and drop the two precedent observations

Code:

date urbanname pm25 co 2019m1 Hannoi 3.309 4.3938

Thank you !

Last edited by Benoit Decoco; 10 Jul 2022, 17:33.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#4

10 Jul 2022, 17:49

That's quite alright.

Let's stick with the real example you gave above. I don't see co in your dataset; give the example that that fully illustrates the problem, please. In fact, just to make this super easy, show me the results of

Code:

dataex if inlist(mainid,25,36,34,39)

if this example adequately describes the problem in question.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

10 Jul 2022, 18:06

Originally posted by Benoit Decoco View Post

Dear Jared,

Please forgive me, as I am not an experienced Stata user / programmer and my native language is not english, I may have difficulties explaining clearly what I wish to do.

For example if I have the following two observations in my dataset

Code:

date urbanname pm25 co
2019m1 Hannoi 3.309 .
2019m1 Hannoi . 4.3938

I'd like to know the commands I should use to obtain the following observation and drop the two precedent observations

Code:

date urbanname pm25 co
2019m1 Hannoi 3.309 4.3938

Thank you !

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str6(date urbanname) float(pm25 co1)
"2019m1" "Hannoi" 3.309 .  
"2019m1" "Hannoi" . 4.3938
end

collapse (firstnm) pm25 co1, by(date urbanname)

Res.:

Code:

. l

     +------------------------------------+
     |   date   urbann~e    pm25      co1 |
     |------------------------------------|
  1. | 2019m1     Hannoi   3.309   4.3938 |
     +------------------------------------+

Comment

Benoit Decoco

Join Date: Jun 2022
Posts: 17

10 Jul 2022, 18:27

I have exactly 167 variables in the dataset, that is why I could not put every variable in the dataex. I managed to order the dataset in a way such that you can see what I am speaking about with a dataex

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long urbancode float date strL urbanname float(pm25 slope_25km)
21275   666 "Hyderabad"            . 1.2345366
 1015   688 "Hyderabad"     165.3226         .
22041   670 "Hyderabad"            .  .7591518
 1015   684 "Hyderabad"     90.96774         .
21275   689 "Hyderabad"            . 1.2345366
 1015   667 "Hyderabad"    130.17241         .
21275   694 "Hyderabad"            . 1.2345366
22041   682 "Hyderabad"            .  .7591518
 1015   681 "Hyderabad"        169.5         .
21275   683 "Hyderabad"            . 1.2345366
 1015   660 "Hyderabad"     94.53333         .
21275   662 "Hyderabad"            . 1.2345366
21275   665 "Hyderabad"            . 1.2345366
 1015   671 "Hyderabad"    114.65517         .
22041   695 "Hyderabad"            .  .7591518
21275   687 "Hyderabad"            . 1.2345366
21275   690 "Hyderabad"            . 1.2345366
 1015   674 "Hyderabad"     171.4516         .
22041   675 "Hyderabad"            .  .7591518
21275   673 "Hyderabad"            . 1.2345366
21275   685 "Hyderabad"            . 1.2345366
21275   676 "Hyderabad"            . 1.2345366
21275   692 "Hyderabad"            . 1.2345366
 1015   663 "Hyderabad"    151.10715         .
21275   650 "Hyderabad"            . 1.2345366
22041   672 "Hyderabad"            .  .7591518
 1015   679 "Hyderabad"     109.2258         .
 1015   691 "Hyderabad"    126.86667         .
 1015   669 "Hyderabad"    150.07408         .
21275   677 "Hyderabad"            . 1.2345366
21275   680 "Hyderabad"            . 1.2345366
 1015   661 "Hyderabad"    126.88461         .
22041   686 "Hyderabad"            .  .7591518
21275   678 "Hyderabad"            . 1.2345366
 1015   668 "Hyderabad"    141.93333         .
 1015   664 "Hyderabad"    149.24138         .
22041   180 "Hyderabad"            .  .7591518
22936 -1102 "Birmingham"           . 3.7266955
23233  -408 "Valencia"             .  8.073236
22936  -255 "Birmingham"           . 3.7266955
22567   648 "Valencia"             .  2.546318
22041  -339 "Hyderabad"            .  .7591518
21275   630 "Hyderabad"            . 1.2345366
22041  -835 "Hyderabad"            .  .7591518
23233   171 "Valencia"             .  8.073236
22041   -18 "Hyderabad"            .  .7591518
21275  -413 "Hyderabad"            . 1.2345366
22567   232 "Valencia"             .  2.546318
22041  -652 "Hyderabad"            .  .7591518
22041  -579 "Hyderabad"            .  .7591518
23233  -356 "Valencia"             .  8.073236
22936  -150 "Birmingham"           . 3.7266955
23233  -521 "Valencia"             .  8.073236
21275  -388 "Hyderabad"            . 1.2345366
21275   478 "Hyderabad"            . 1.2345366
23233   457 "Valencia"             .  8.073236
22835  -224 "Birmingham"           . 1.8021827
22567    85 "Valencia"             .  2.546318
23233  -633 "Valencia"             .  8.073236
22936   -27 "Birmingham"           . 3.7266955
21275   524 "Hyderabad"            . 1.2345366
22835  -117 "Birmingham"           . 1.8021827
21275  -264 "Hyderabad"            . 1.2345366
21275 -1000 "Hyderabad"            . 1.2345366
22835  -686 "Birmingham"           . 1.8021827
21275    66 "Hyderabad"            . 1.2345366
22835  -599 "Birmingham"           . 1.8021827
21275 -1070 "Hyderabad"            . 1.2345366
22567  -703 "Valencia"             .  2.546318
22936   284 "Birmingham"           . 3.7266955
22936   179 "Birmingham"           . 3.7266955
21275   148 "Hyderabad"            . 1.2345366
22041  -970 "Hyderabad"            .  .7591518
22041 -1017 "Hyderabad"            .  .7591518
22936  -433 "Birmingham"           . 3.7266955
22835   622 "Birmingham"           . 1.8021827
20519   690 "Guiyang"              .  6.462536
23233   642 "Valencia"             .  8.073236
22041  -649 "Hyderabad"            .  .7591518
23233    91 "Valencia"             .  8.073236
22835    44 "Birmingham"           . 1.8021827
22567  -906 "Valencia"             .  2.546318
22936  -102 "Birmingham"           . 3.7266955
22936  -424 "Birmingham"           . 3.7266955
22041  -333 "Hyderabad"            .  .7591518
22835  -428 "Birmingham"           . 1.8021827
23233  -916 "Valencia"             .  8.073236
23233  -668 "Valencia"             .  8.073236
23233  -718 "Valencia"             .  8.073236
22567   -91 "Valencia"             .  2.546318
22567    51 "Valencia"             .  2.546318
20669   656 "Shijiazhuang"         . 1.7961366
22936  -375 "Birmingham"           . 3.7266955
20669   670 "Shijiazhuang"         . 1.7961366
21275  -231 "Hyderabad"            . 1.2345366
22835    43 "Birmingham"           . 1.8021827
23233    -9 "Valencia"             .  8.073236
22936  -945 "Birmingham"           . 3.7266955
21275  -829 "Hyderabad"            . 1.2345366
22936  -595 "Birmingham"           . 3.7266955
end
format %tm date

You see that some of the rows have the same date and the same urbanname but some of them have a stored value in pm2.5 and don't have a stored value in slope_25km while some others have the exact opposite.
I'd like to have one unique row with both of pm25 and slope_25km columns filled with a value.

Comment

Benoit Decoco

Join Date: Jun 2022
Posts: 17

10 Jul 2022, 18:31

Originally posted by Andrew Musau View Post

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str6(date urbanname) float(pm25 co1)
"2019m1" "Hannoi" 3.309 .
"2019m1" "Hannoi" . 4.3938
end

collapse (firstnm) pm25 co1, by(date urbanname)

Res.:

Code:

. l

+------------------------------------+
| date urbann~e pm25 co1 |
|------------------------------------|
1. | 2019m1 Hannoi 3.309 4.3938 |
+------------------------------------+

Dear Andrew,

This would perfectly work if I only had these variables, forgive me I did not properly describe my dataset.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10190
#8

10 Jul 2022, 19:00

In your data example, there are no duplicates of the variables date and urbanname, so it is not clear what you want.

Code:

isid date urbanname
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#9

10 Jul 2022, 19:05

Andrew Musau the collapse firstnm syntax was what I was trying to lead up to, but as you note it wasn't in the original data example.
1 like
Comment

Benoit Decoco

Join Date: Jun 2022
Posts: 17

#10

11 Jul 2022, 04:22

The command

Code:

isid date urbnanme

returns

Code:

variables date and urbanname do not uniquely identify the observations
r(459);

This is an example for one city.
See the duplicates 695 for instance.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input strL urbanname float(date pm25 co aod3K_mean_Aqua_10km) .
"Wenzhou" 703 114.06451  8.064516         .
"Wenzhou" 702 102.83334       6.6         .
"Wenzhou" 701   90.3871  5.935484         .
"Wenzhou" 700 130.74193  8.387096         .
"Wenzhou" 699 121.10714  6.678571         .
"Wenzhou" 698  96.40741  7.111111         .
"Wenzhou" 697 130.96666       7.5         .
"Wenzhou" 696  83.25806  6.709677         .
"Wenzhou" 695         .         .     .4394
"Wenzhou" 695  96.42308  6.576923         .
"Wenzhou" 694  95.06451  6.516129         .
"Wenzhou" 694         .         .  .3622546
"Wenzhou" 693         .         .  .4417194
"Wenzhou" 693     116.1  7.166667         .
"Wenzhou" 692  147.6129  9.645162         .
"Wenzhou" 692         .         .   .796245
"Wenzhou" 691         .         .  .7701847
"Wenzhou" 691 112.96774   9.16129         .
"Wenzhou" 690         .         .  .8079634
"Wenzhou" 690     119.1       9.9         .
"Wenzhou" 689         .         . .58283335
"Wenzhou" 689 100.25806  7.290323         .
"Wenzhou" 688 125.76667 10.354838         .
"Wenzhou" 688         .         .  .6874976
"Wenzhou" 687 136.46428  8.535714         .
"Wenzhou" 687         .         .  .6338925
"Wenzhou" 686 148.54839  8.193548         .
"Wenzhou" 686         .         .  .6966229
"Wenzhou" 685 140.76666 12.066667         .
"Wenzhou" 685         .         .  .5206547
"Wenzhou" 684         .         .  .6713136
"Wenzhou" 684  98.96774  6.774194         .
"Wenzhou" 683         .         .  .3829562
"Wenzhou" 683      99.4  7.066667         .
"Wenzhou" 682         .         .  .3242331
"Wenzhou" 682  99.45161  8.903226         .
"Wenzhou" 681 125.33334 11.933333         .
"Wenzhou" 681         .         . .19232053
"Wenzhou" 680 141.45161 14.258064         .
"Wenzhou" 680         .         .  .7010399
"Wenzhou" 679         .         .  .7567334
"Wenzhou" 679 132.06451 10.903226         .
"Wenzhou" 678         .         .  .7867171
"Wenzhou" 678       111       9.5         .
"Wenzhou" 677         .         . .57805836
"Wenzhou" 677 102.03226  8.451612         .
"Wenzhou" 676         .         .  .7786242
"Wenzhou" 676 127.78571 13.310345         .
"Wenzhou" 675         .         .  .6825334
"Wenzhou" 675 128.48276 13.034483         .
"Wenzhou" 674 133.87097 11.633333         .
"Wenzhou" 674         .         .  .7551135
"Wenzhou" 673         .         .  .4213091
"Wenzhou" 673       136      12.9         .
end
format %tm date

I'd like to uniquely identify an observation with date and urbanname and fill the missing values of variable of a duplicate by the values contained in the other duplicate. Meaning, for instance with duplicate 695

Code:

input strL urbanname float(date pm25 co aod3K_mean_Aqua_10km) .
"Wenzhou" 695         .         .     .4394
"Wenzhou" 695  96.42308  6.576923         .

becomes

Code:

input strL urbanname float(date pm25 co aod3K_mean_Aqua_10km) .
"Wenzhou" 695    96.42308         6.576923      .4394

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10190
#11

11 Jul 2022, 04:31

I already gave you the code in #5. With many variables, you can

Code:

ds urbanname date, not collapse (firstnm) `r(varlist)', by(urbanname date) isid urbanname date
Comment
Benoit Decoco

Join Date: Jun 2022

Posts: 17
#12

11 Jul 2022, 04:49

Thank you it worked perfectly ! Pardon me again for the misunderstanding.
Comment

Announcement

Fill the . values of duplicates based on two variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment