Drop missing observations at the end of each group

Hugo Denis

Join Date: Apr 2022

Posts: 27
#1

Drop missing observations at the end of each group

19 Jan 2023, 00:54

Dear Statalist members,

I would like to drop missing observations based on their position in the dataset, i.e. whenever they are at the end of a group. I used to have a dataset where each variable was a city and observations were individuals, so for cities with a few individuals only, the rest of the "column" (because it comes from Excel) would be missing because individuals were filled in other cities for this observation.

Now that I reshaped this data into an appropriate, format, I would like to drop all the missing variables that are located at the end of each city, i.e. from the last non missing observation to the end of the dataset. Now there are missing individuals that I want to keep, and those are the ones who just didn't fill their names and should be between two non-missing observations.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str5 city str2 individual "city1" "a" "city1" "b" "city1" "c" "city1" "" "city1" "d" "city1" "" "city1" "" "city1" "" "city1" "" "city2" "a" end

The idea here is to drop all the observations from the last nonmissing value i.e. city1, d to the first observation of the next group i.e. city2, a. But the missing value between city1, c and city2, d should be kept. How can I do something like this?
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10190

19 Jan 2023, 01:04

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str5 city str2 individual
"city1" "a"
"city1" "b"
"city1" "c"
"city1" "" 
"city1" "d"
"city1" "" 
"city1" "" 
"city1" "" 
"city1" "" 
"city2" "a"
end
gen order=_n
bys city (order): gen sum= sum(!missing(individual))
by city: egen max= max(sum)
drop if sum==max & missing(individual)

Res.:

Code:

. l, sep(0)

     +--------------------------------------+
     |  city   indivi~l   order   sum   max |
     |--------------------------------------|
  1. | city1          a       1     1     4 |
  2. | city1          b       2     2     4 |
  3. | city1          c       3     3     4 |
  4. | city1                  4     3     4 |
  5. | city1          d       5     4     4 |
  6. | city2          a      10     1     1 |
     +--------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35668

19 Jan 2023, 01:08

There is some variation in terminology here, but the problem is clearly that of dropping observations with missing values beyond the last non-missing value for each city (and not one of dropping variables, and not quite one of dropping missing observations).

So you find the last non-missing value, and then proceed. See also https://www.stata.com/support/faqs/d...t-occurrences/ and https://www.stata-journal.com/articl...article=dm0055 (especially Section 9). You may already have a variable that would serve instead of obsno created here.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str5 city str2 individual
"city1" "a"
"city1" "b"
"city1" "c"
"city1" ""
"city1" "d"
"city1" ""
"city1" ""
"city1" ""
"city1" ""
"city2" "a"
"city2" ""
end

gen long obsno = _n

bysort city: egen lastnm = max(cond(individual != "", obsno, .))

list, sepby(city)

drop if obsno > lastnm

list

Results:

Code:

. list, sepby(city)

     +-----------------------------------+
     |  city   indivi~l   obsno   lastnm |
     |-----------------------------------|
  1. | city1          a       1        5 |
  2. | city1          b       2        5 |
  3. | city1          c       3        5 |
  4. | city1                  4        5 |
  5. | city1          d       5        5 |
  6. | city1                  6        5 |
  7. | city1                  7        5 |
  8. | city1                  8        5 |
  9. | city1                  9        5 |
     |-----------------------------------|
 10. | city2          a      10       10 |
 11. | city2                 11       10 |
     +-----------------------------------+

.
. drop if obsno > lastnm
(5 observations deleted)

.
. list

     +-----------------------------------+
     |  city   indivi~l   obsno   lastnm |
     |-----------------------------------|
  1. | city1          a       1        5 |
  2. | city1          b       2        5 |
  3. | city1          c       3        5 |
  4. | city1                  4        5 |
  5. | city1          d       5        5 |
     |-----------------------------------|
  6. | city2          a      10       10 |
     +-----------------------------------+

Last edited by Nick Cox; 19 Jan 2023, 01:30.

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3449
#4

19 Jan 2023, 01:25

So the order of observations seems to be very important, so my first step is to create a variable called order that stores that order. The ! is the not operator, so !missing() is true (1) when the argument is not missing and false (0) otherwise. The sum() function returns the running sum, so for the first observation it is just notmiss for that observation, for the second observation is notmiss for the previous observation plus notmiss for the current observation and so on. With the by prefix I make sure to do that for each city separately, and with (order) I sort within each city by the variable that contains the original order, i.e. I maintain the order in the dataset. I create a variable todrop to indicate which observations are to be dropped. You could drop them directly, but this way you can first inspect those observations before dropping them. It is just to easy to make a mistake with such logic statements, that I don't trust myself and I want to check before I drop. So that is why I added that extra step. Doing a list like in the example for your complete dataset is likely to result in way to much output, but you can do something like list city individual notmiss todrop in 1/50, sepby(city) to limit the number of variables and observations you show on screen. After that it is just a matter of dropping the marked observations and dropping the variables you no longer need.

Code:

. clear . input str5 city str2 individual city individ~l 1. "city1" "a" 2. "city1" "b" 3. "city1" "c" 4. "city1" "" 5. "city1" "d" 6. "city1" "" 7. "city1" "" 8. "city1" "" 9. "city1" "" 10. "city2" "a" 11. end . . gen order = _n . gen notmiss = !missing(individual) . bysort city (order): replace notmiss = sum(notmiss) (8 real changes made) . by city : gen byte todrop = (notmiss == notmiss[_N] & missing(individual)) . list, sepby(city) +---------------------------------------------+ | city indivi~l order notmiss todrop | |---------------------------------------------| 1. | city1 a 1 1 0 | 2. | city1 b 2 2 0 | 3. | city1 c 3 3 0 | 4. | city1 4 3 0 | 5. | city1 d 5 4 0 | 6. | city1 6 4 1 | 7. | city1 7 4 1 | 8. | city1 8 4 1 | 9. | city1 9 4 1 | |---------------------------------------------| 10. | city2 a 10 1 0 | +---------------------------------------------+ . drop if todrop (4 observations deleted) . drop notmiss todrop order . list, sepby(city) +------------------+ | city indivi~l | |------------------| 1. | city1 a | 2. | city1 b | 3. | city1 c | 4. | city1 | 5. | city1 d | |------------------| 6. | city2 a | +------------------+

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment

Announcement