Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing data entries after collapse-command

    Hello members,

    I have tried to aggregate my data to one observation per establishment per year. I used the following command to do this:
    sort betnr year
    collapse (mean) dummy w08_1 div wage_imp, by(betnr year)

    Betnr is the establishment number.

    However, with this command one or two establishments are lost for each year.

    What is the reason for this and how can missing companies be identified?

    Thank you for your support!

  • #2
    Missing categories -- whether coded numerically or as missing strings -- don't disappear on a collapse, as this experiment shows.

    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . gen frog = "frog" in 1/37
    (37 missing values generated)
    
    . collapse mpg, by(frog foreign rep78)
    
    . l
    
         +-----------------------------------+
         | rep78    foreign   frog       mpg |
         |-----------------------------------|
      1. |     1   Domestic               21 |
      2. |     2   Domestic               21 |
      3. |     3   Domestic               21 |
      4. |     4   Domestic               18 |
      5. |     5   Domestic               34 |
         |-----------------------------------|
      6. |     .   Domestic             22.5 |
      7. |     3    Foreign          23.3333 |
      8. |     4    Foreign          24.8889 |
      9. |     5    Foreign          26.3333 |
     10. |     .    Foreign               14 |
         |-----------------------------------|
     11. |     2   Domestic   frog      18.5 |
     12. |     3   Domestic   frog   18.4286 |
     13. |     4   Domestic   frog   18.5714 |
     14. |     5   Domestic   frog        30 |
     15. |     .   Domestic   frog        24 |
         +-----------------------------------+
    
    .
    So, I think we need to see some evidence that "one or two establishments are lost for each year".

    Comment


    • #3
      Hello Nick,
      maybe you remember my post from Saturday, 09 Dec 2023, 04:42, where I asked how to count the number of different values for the variable betnr per year.

      With your help I generated a variable that reflects the number of different values of betnr.

      bysort year (betnr): gen count_betnr = sum(betnr != betnr[_n-1])

      . by year: replace count_betnr = count_betnr[_N]
      (529,324 real changes made)
      .
      . sum count_betnr if year==2000

      Variable | Obs Mean Std. dev. Min Max
      -------------+---------------------------------------------------------
      count_betnr | 22,882 4564 0 4564 4564

      After that I collapsed as described above.
      sort betnr year
      collapse (mean) dummy w08_1 div wage_imp, by(betnr year)

      Then I compared the sum statistic for the aggregated variable.

      .sum betnr if year==2000

      Variable | Obs Mean Std. dev. Min Max
      -------------+---------------------------------------------------------
      betnr | 4,562 4.97e+07 97959.63 4.96e+07 4.99e+07


      That's how I recognized that 2 establishments are "lost", see bold and underlined values.

      Comment


      • #4
        The key to this may lie in the fact that your original betnr are typically very big. What variable type are you using to hold them?

        If you merge back with your original data you should see which establishments have been omitted.

        Comment


        • #5
          Thank you Nick.
          The variable type of betnr is "long".

          Comment


          • #6
            Do you what can be done, to avoid this problem?

            Comment


            • #7
              long should be fine. Thanks for that detail.

              Clearly your dataset is far too large to post here and you've not provided a data example showing the problem.

              In #4 I suggested

              If you merge back with your original data you should see which establishments have been omitted.
              That's my only remaining idea, sorry, but you don't seem to have tried it.

              Comment


              • #8
                I tried your idea with the merge-command, the problem is that I get the following error message: variable betnr does not uniquely identify observations in the master data

                So merging back with my original data seems not possible

                Comment


                • #9
                  Not so. You've got overlapping variables; the only details are that the merge should be on betnr year and is emphatically not 1:1.

                  Consider this example:

                  Code:
                  . sysuse auto, clear
                  (1978 automobile data)
                  
                  . collapse mpg, by(foreign rep78)
                  
                  . l
                  
                       +----------------------------+
                       | rep78    foreign       mpg |
                       |----------------------------|
                    1. |     1   Domestic        21 |
                    2. |     2   Domestic    19.125 |
                    3. |     3   Domestic        19 |
                    4. |     4   Domestic   18.4444 |
                    5. |     5   Domestic        32 |
                       |----------------------------|
                    6. |     .   Domestic     23.25 |
                    7. |     3    Foreign   23.3333 |
                    8. |     4    Foreign   24.8889 |
                    9. |     5    Foreign   26.3333 |
                   10. |     .    Foreign        14 |
                       +----------------------------+
                  
                  . set obs 11
                  Number of observations (_N) was 10, now 11.
                  
                  . replace rep78 = 6 in L
                  (1 real change made)
                  
                  . drop mpg
                  
                  . merge 1:m foreign rep78 using auto
                  (label origin already defined)
                  
                      Result                      Number of obs
                      -----------------------------------------
                      Not matched                             1
                          from master                         1  (_merge==1)
                          from using                          0  (_merge==2)
                  
                      Matched                                74  (_merge==3)
                      -----------------------------------------
                  Now here I know what the problem is, because I created it. But merge flags the problem, an observation -- for which _merge is 1 -- which I created deliberately in the reduced dataset that wasn't in the original.

                  Your problem is, if I understand it correctly, that observations in the original don't show up even in summary form in the reduced, and for those _merge should be 2.

                  Comment


                  • #10
                    Thank you Nick, I adjusted the merge-command to

                    merge 1:m betnr jahr using "$data/orig_data", nogen keep(match) gen(_merge2)

                    but with this I get the error message "err not found"

                    unfortunately I don't understand to what this error message refers to

                    Comment


                    • #11
                      Neither do I, unless it's a side-effect of the global macro you're using.

                      Comment


                      • #12
                        I also tried it without the global macro and with the exact file path, but the same error message appears again.

                        A big thank you Nick for your help!

                        Comment


                        • #13
                          So, we now have results that look puzzling for both collapse and merge, both long-standing commands that have been banged on thousands, indeed I guess millions, of times, such that new bugs in either might be surprising. I am wondering if there are bizarre details in your dataset.

                          I still have hopes of ways forward.

                          1. The first puzzle is implied by the variables

                          Code:
                          dummy w08_1 div wage_imp betnr
                          for the subset jahr or year 2000. How many observations is that in the original dataset? Can you show the problem in a dataset small enough to post here using dataex?

                          2. Neither command has short or simple code. But setting

                          Code:
                          set trace on 
                          set traced 1
                          before the merge call might show where the error is coming from.

                          Comment

                          Working...
                          X