"Collapse" command produces duplicate observation

Mark Flynn

Join Date: Mar 2024
Posts: 12

"Collapse" command produces duplicate observation

31 Jul 2024, 13:08

Hi,

I am having a little trouble with the collapse command.

I am using panel data for countries between 2019-2023 and wish to convert this to cross sectional by generating the mean of all variables by country.

I used the following command:

collapse (mean) oneyrs-id2, by(country)

This worked successfully in removing the observations over time and replacing the observations with the mean over time for all of my countries bar Luxembourg in which there are two observations for this country.

country	oneyrs	twoyrs	threeyrs
aus	246201.2	278573.8	304683.6
bel	428949.6	457741	478359.6
cro	32479.667	42146.333	46478.667
den	182680	191480.6	195460.8
ger	2828315.4	2980678.2	3167158.6
ita	1713787.8	1835984.2	1887771.6
lux	45127.75	46993	46686.75
lux	49898	49124	44444
pol	460628.2	466367.6	474317
spa	1005823.6	1043116.8	1076492.6

The table is a refined version of the codebook which I'm left with.

I was wondering if anyone could help explain why Luxembourg appears twice?

I've looked over the original data and was unable to see anything unusual compared to the other observations.

Many thanks in advance

Tags: None

Mark Flynn

Join Date: Mar 2024

Posts: 12
#2

31 Jul 2024, 13:14

I am using stata 18
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#3

31 Jul 2024, 13:25

Without seeing the original data set (or a subset of it that reproduces the same behavior) I'd be astonished if anybody would be able to help you on this. Please post back with that, using the -dataex- command to show it.

If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#4

31 Jul 2024, 14:05

Mark Flynn, would you please send the original dataset and the command you used to tech support [email protected] so we can take a look. Thanks.
Comment
Mark Flynn

Join Date: Mar 2024

Posts: 12
#5

31 Jul 2024, 14:09

My apologies, I'll upload/send the original data
Comment
Mark Flynn

Join Date: Mar 2024

Posts: 12
#6

31 Jul 2024, 14:22

The Original (Trimmed) Data:

country year oneyrs twoyrs threeyrs
"aus" 2019 296876 320988 311168
"aus" 2020 285606 308882 332294
"aus" 2021 224492 300687 321548
"aus" 2022 222231 235790 316256
"aus" 2023 201801 226522 242152
"bel" 2019 508589 496766 480378
"bel" 2020 501259 503310 491963
"bel" 2021 405492 500390 501854
"bel" 2022 371754 410607 507790
"bel" 2023 357654 377632 409813
"fra" 2019 2185495 2182044 2091391
"fra" 2020 2341000 2317000 2257000
"fra" 2021 2356449 2332290 2271894
"fra" 2022 1623759 1687573 2287993
"fra" 2023 1470404 1642163 1693700
"ger" 2019 3151097 3221432 3176396
"ger" 2020 3285437 3210512 3261026
"ger" 2021 2738009 3394682 3281825
"ger" 2022 2476735 2724713 3450993
"ger" 2023 2490299 2352052 2665553
"lux" 2020 51833 49209 46109
"lux" 2021 44104 52227 47130
"lux" 2022 43608 44473 50526
"lux" 2023 40966 42063 42982
"lux " 2019 49898 49124 44444
"pol" 2019 507340 465915 422735
"pol" 2020 531468 505680 482966
"pol" 2021 413245 528727 519323
"pol" 2022 436514 404869 538098
"pol" 2023 414574 426647 408463

I notice here that despite sorting the data by country then year Luxembourg is out of order, would this be the source of the issue?
Comment

Mark Flynn

Join Date: Mar 2024
Posts: 12

31 Jul 2024, 14:27

After using the following command:

Code:

collapse (mean) oneyrs-threeyrs, by(country)

my data now looks like this

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 country double(oneyrs twoyrs threeyrs)
"aus"            246201.2           278573.8           304683.6
"bel"            428949.6             457741           478359.6
"fra"           1995421.4            2032214          2120395.6
"ger"           2828315.4          2980678.2          3167158.6
"lux"            45127.75              46993           46686.75
"lux "              49898              49124              44444
"pol"            460628.2           466367.6             474317

end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#8

31 Jul 2024, 14:28

Notice that "lux" is not the same as "lux " (N.B. final blank space). That is why 2019 is out of sequence and why -collapse- is producing two separate results that look like "lux" to human eyes, but which are, in fact, "lux" and "lux " to Stata.

String variables can be treacherous that way. When I work with them, I usually, as part of routine data cleaning, strip leading, trailing, and surplus interior blanks to avoid this kind of problem.

Code:

ds, has(type string) local str_vars `r(varlist)' foreach s of local str_vars { replace `v' = trim(itrim(`v')) }

I will do this for any string variable unless the presence of additional blanks is not accidental but actually has been used to designate a separate entity (which is both unusual and a terrible data practice).
1 like
Comment
Mark Flynn

Join Date: Mar 2024

Posts: 12
#9

31 Jul 2024, 14:29

The issue I encounter is that Luxembourg appears with two observations. The command has correctly performed for all other countries bar Luxembourg
Comment
Mark Flynn

Join Date: Mar 2024

Posts: 12
#10

31 Jul 2024, 14:32

Thank you Clyde, I appreciate the help - Problem solved!
Comment

Announcement

"Collapse" command produces duplicate observation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment