Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Collapse" command produces duplicate observation

    Hi,

    I am having a little trouble with the collapse command.

    I am using panel data for countries between 2019-2023 and wish to convert this to cross sectional by generating the mean of all variables by country.

    I used the following command:

    collapse (mean) oneyrs-id2, by(country)

    This worked successfully in removing the observations over time and replacing the observations with the mean over time for all of my countries bar Luxembourg in which there are two observations for this country.
    country oneyrs twoyrs threeyrs
    aus 246201.2 278573.8 304683.6
    bel 428949.6 457741 478359.6
    cro 32479.667 42146.333 46478.667
    den 182680 191480.6 195460.8
    ger 2828315.4 2980678.2 3167158.6
    ita 1713787.8 1835984.2 1887771.6
    lux 45127.75 46993 46686.75
    lux 49898 49124 44444
    pol 460628.2 466367.6 474317
    spa 1005823.6 1043116.8 1076492.6
    The table is a refined version of the codebook which I'm left with.

    I was wondering if anyone could help explain why Luxembourg appears twice?

    I've looked over the original data and was unable to see anything unusual compared to the other observations.

    Many thanks in advance


  • #2
    I am using stata 18

    Comment


    • #3
      Without seeing the original data set (or a subset of it that reproduces the same behavior) I'd be astonished if anybody would be able to help you on this. Please post back with that, using the -dataex- command to show it.

      If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Comment


      • #4
        Mark Flynn, would you please send the original dataset and the command you used to tech support [email protected] so we can take a look. Thanks.

        Comment


        • #5
          My apologies, I'll upload/send the original data

          Comment


          • #6
            The Original (Trimmed) Data:

            country year oneyrs twoyrs threeyrs
            "aus" 2019 296876 320988 311168
            "aus" 2020 285606 308882 332294
            "aus" 2021 224492 300687 321548
            "aus" 2022 222231 235790 316256
            "aus" 2023 201801 226522 242152
            "bel" 2019 508589 496766 480378
            "bel" 2020 501259 503310 491963
            "bel" 2021 405492 500390 501854
            "bel" 2022 371754 410607 507790
            "bel" 2023 357654 377632 409813
            "fra" 2019 2185495 2182044 2091391
            "fra" 2020 2341000 2317000 2257000
            "fra" 2021 2356449 2332290 2271894
            "fra" 2022 1623759 1687573 2287993
            "fra" 2023 1470404 1642163 1693700
            "ger" 2019 3151097 3221432 3176396
            "ger" 2020 3285437 3210512 3261026
            "ger" 2021 2738009 3394682 3281825
            "ger" 2022 2476735 2724713 3450993
            "ger" 2023 2490299 2352052 2665553
            "lux" 2020 51833 49209 46109
            "lux" 2021 44104 52227 47130
            "lux" 2022 43608 44473 50526
            "lux" 2023 40966 42063 42982
            "lux " 2019 49898 49124 44444
            "pol" 2019 507340 465915 422735
            "pol" 2020 531468 505680 482966
            "pol" 2021 413245 528727 519323
            "pol" 2022 436514 404869 538098
            "pol" 2023 414574 426647 408463

            I notice here that despite sorting the data by country then year Luxembourg is out of order, would this be the source of the issue?


            Comment


            • #7
              After using the following command:

              Code:
              collapse (mean) oneyrs-threeyrs, by(country)
              my data now looks like this

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str4 country double(oneyrs twoyrs threeyrs)
              "aus"            246201.2           278573.8           304683.6
              "bel"            428949.6             457741           478359.6
              "fra"           1995421.4            2032214          2120395.6
              "ger"           2828315.4          2980678.2          3167158.6
              "lux"            45127.75              46993           46686.75
              "lux "              49898              49124              44444
              "pol"            460628.2           466367.6             474317
              
              end

              Comment


              • #8
                Notice that "lux" is not the same as "lux " (N.B. final blank space). That is why 2019 is out of sequence and why -collapse- is producing two separate results that look like "lux" to human eyes, but which are, in fact, "lux" and "lux " to Stata.

                String variables can be treacherous that way. When I work with them, I usually, as part of routine data cleaning, strip leading, trailing, and surplus interior blanks to avoid this kind of problem.
                Code:
                ds, has(type string)
                local str_vars `r(varlist)'
                
                foreach s of local str_vars {
                    replace `v' = trim(itrim(`v'))
                }
                I will do this for any string variable unless the presence of additional blanks is not accidental but actually has been used to designate a separate entity (which is both unusual and a terrible data practice).

                Comment


                • #9
                  The issue I encounter is that Luxembourg appears with two observations. The command has correctly performed for all other countries bar Luxembourg

                  Comment


                  • #10
                    Thank you Clyde, I appreciate the help - Problem solved!

                    Comment

                    Working...
                    X