Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop missing observations at the end of each group

    Dear Statalist members,

    I would like to drop missing observations based on their position in the dataset, i.e. whenever they are at the end of a group. I used to have a dataset where each variable was a city and observations were individuals, so for cities with a few individuals only, the rest of the "column" (because it comes from Excel) would be missing because individuals were filled in other cities for this observation.

    Now that I reshaped this data into an appropriate, format, I would like to drop all the missing variables that are located at the end of each city, i.e. from the last non missing observation to the end of the dataset. Now there are missing individuals that I want to keep, and those are the ones who just didn't fill their names and should be between two non-missing observations.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 city str2 individual
    "city1" "a"
    "city1" "b"
    "city1" "c"
    "city1" "" 
    "city1" "d"
    "city1" "" 
    "city1" "" 
    "city1" "" 
    "city1" "" 
    "city2" "a"
    end

    The idea here is to drop all the observations from the last nonmissing value i.e. city1, d to the first observation of the next group i.e. city2, a. But the missing value between city1, c and city2, d should be kept. How can I do something like this?

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 city str2 individual
    "city1" "a"
    "city1" "b"
    "city1" "c"
    "city1" "" 
    "city1" "d"
    "city1" "" 
    "city1" "" 
    "city1" "" 
    "city1" "" 
    "city2" "a"
    end
    gen order=_n
    bys city (order): gen sum= sum(!missing(individual))
    by city: egen max= max(sum)
    drop if sum==max & missing(individual)
    Res.:

    Code:
    . l, sep(0)
    
         +--------------------------------------+
         |  city   indivi~l   order   sum   max |
         |--------------------------------------|
      1. | city1          a       1     1     4 |
      2. | city1          b       2     2     4 |
      3. | city1          c       3     3     4 |
      4. | city1                  4     3     4 |
      5. | city1          d       5     4     4 |
      6. | city2          a      10     1     1 |
         +--------------------------------------+

    Comment


    • #3
      There is some variation in terminology here, but the problem is clearly that of dropping observations with missing values beyond the last non-missing value for each city (and not one of dropping variables, and not quite one of dropping missing observations).

      So you find the last non-missing value, and then proceed. See also https://www.stata.com/support/faqs/d...t-occurrences/ and https://www.stata-journal.com/articl...article=dm0055 (especially Section 9). You may already have a variable that would serve instead of obsno created here.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str5 city str2 individual
      "city1" "a"
      "city1" "b"
      "city1" "c"
      "city1" ""
      "city1" "d"
      "city1" ""
      "city1" ""
      "city1" ""
      "city1" ""
      "city2" "a"
      "city2" ""
      end
      
      gen long obsno = _n
      
      bysort city: egen lastnm = max(cond(individual != "", obsno, .))
      
      list, sepby(city)
      
      drop if obsno > lastnm
      
      list
      Results:


      Code:
      . list, sepby(city)
      
           +-----------------------------------+
           |  city   indivi~l   obsno   lastnm |
           |-----------------------------------|
        1. | city1          a       1        5 |
        2. | city1          b       2        5 |
        3. | city1          c       3        5 |
        4. | city1                  4        5 |
        5. | city1          d       5        5 |
        6. | city1                  6        5 |
        7. | city1                  7        5 |
        8. | city1                  8        5 |
        9. | city1                  9        5 |
           |-----------------------------------|
       10. | city2          a      10       10 |
       11. | city2                 11       10 |
           +-----------------------------------+
      
      .
      . drop if obsno > lastnm
      (5 observations deleted)
      
      .
      . list
      
           +-----------------------------------+
           |  city   indivi~l   obsno   lastnm |
           |-----------------------------------|
        1. | city1          a       1        5 |
        2. | city1          b       2        5 |
        3. | city1          c       3        5 |
        4. | city1                  4        5 |
        5. | city1          d       5        5 |
           |-----------------------------------|
        6. | city2          a      10       10 |
           +-----------------------------------+
      Last edited by Nick Cox; 19 Jan 2023, 01:30.

      Comment


      • #4
        So the order of observations seems to be very important, so my first step is to create a variable called order that stores that order. The ! is the not operator, so !missing() is true (1) when the argument is not missing and false (0) otherwise. The sum() function returns the running sum, so for the first observation it is just notmiss for that observation, for the second observation is notmiss for the previous observation plus notmiss for the current observation and so on. With the by prefix I make sure to do that for each city separately, and with (order) I sort within each city by the variable that contains the original order, i.e. I maintain the order in the dataset. I create a variable todrop to indicate which observations are to be dropped. You could drop them directly, but this way you can first inspect those observations before dropping them. It is just to easy to make a mistake with such logic statements, that I don't trust myself and I want to check before I drop. So that is why I added that extra step. Doing a list like in the example for your complete dataset is likely to result in way to much output, but you can do something like list city individual notmiss todrop in 1/50, sepby(city) to limit the number of variables and observations you show on screen. After that it is just a matter of dropping the marked observations and dropping the variables you no longer need.

        Code:
        . clear
        
        . input str5 city str2 individual
        
                  city  individ~l
          1. "city1" "a"
          2. "city1" "b"
          3. "city1" "c"
          4. "city1" ""
          5. "city1" "d"
          6. "city1" ""
          7. "city1" ""
          8. "city1" ""
          9. "city1" ""
         10. "city2" "a"
         11. end
        
        .
        . gen order = _n
        
        . gen notmiss = !missing(individual)
        
        . bysort city (order): replace notmiss = sum(notmiss)
        (8 real changes made)
        
        . by city : gen byte todrop = (notmiss == notmiss[_N] & missing(individual))
        
        . list, sepby(city)
        
             +---------------------------------------------+
             |  city   indivi~l   order   notmiss   todrop |
             |---------------------------------------------|
          1. | city1          a       1         1        0 |
          2. | city1          b       2         2        0 |
          3. | city1          c       3         3        0 |
          4. | city1                  4         3        0 |
          5. | city1          d       5         4        0 |
          6. | city1                  6         4        1 |
          7. | city1                  7         4        1 |
          8. | city1                  8         4        1 |
          9. | city1                  9         4        1 |
             |---------------------------------------------|
         10. | city2          a      10         1        0 |
             +---------------------------------------------+
        
        . drop if todrop
        (4 observations deleted)
        
        . drop notmiss todrop order
        
        . list, sepby(city)
        
             +------------------+
             |  city   indivi~l |
             |------------------|
          1. | city1          a |
          2. | city1          b |
          3. | city1          c |
          4. | city1            |
          5. | city1          d |
             |------------------|
          6. | city2          a |
             +------------------+
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment

        Working...
        X