How to Distinctively Remove the Observations with Specific Missing Value Patterns in Stata?

smith Jason

Join Date: Sep 2020

Posts: 380
#1

How to Distinctively Remove the Observations with Specific Missing Value Patterns in Stata?

23 Jul 2022, 22:39

Hello, I have a small dataset as follows,
clear
input byte (id gr)
1 1
1 2
1 2
1 .
1 .
1 .
2 1
2 2
2 3
2 .
2 .
2 .
2 5
2 6
3 1
3 .
3 3
3 4
3 .
3 6
4 1
4 1
4 .
4 .
4 .
4 .
4 .
end

I want to delete the data with the continuous missing value from specific position to the last value in the variable of "gr" within id, like the data with id==4 and id==1.
As for the data with id==2 and id==3, although they have missing values, the missing values from specific position to the last value in the variable of "gr" within id is not continuous.
So, the data with id==2 and id==3 should be kept.
Can someone help me with Stata code?
Thank you!
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

23 Jul 2022, 22:50

Code:

//  MARK CURRENT ORDERING
gen long obs_no = _n

//  MARK RUNS OF MISSING VALUES/ NON-MISSING VALUES
by id (obs_no), sort: gen run_num = sum(missing(gr) != missing(gr[_n-1]))

//  REMOVE A FINAL RUN OF ALL MISSING VALUES
by id (obs_no): drop if run_num == run_num[_N] & missing(gr[_N])

Comment

smith Jason

Join Date: Sep 2020
Posts: 380

24 Jul 2022, 15:19

Originally posted by Clyde Schechter View Post

Code:

// MARK CURRENT ORDERING
gen long obs_no = _n

// MARK RUNS OF MISSING VALUES/ NON-MISSING VALUES
by id (obs_no), sort: gen run_num = sum(missing(gr) != missing(gr[_n-1]))

// REMOVE A FINAL RUN OF ALL MISSING VALUES
by id (obs_no): drop if run_num == run_num[_N] & missing(gr[_N])

Could you explain the second line of your code, thank you!
I don't fully understand the meaning of it.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

24 Jul 2022, 16:15

Within each id, with the observations in their original order, we create a running count of runs of consecutive missing or consecutive non-missing values of gr. But I suppose that is just a slight rephrasing of the comment in the code, so perhaps not helpful. I think that it is hard to put in words, but easy to see. Just run the first two lines of code and then -browse- the data. I think it will be easy to spot what has happened.
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#5

24 Jul 2022, 16:24

Thank you, professor!
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#6

24 Jul 2022, 18:39

Originally posted by Clyde Schechter View Post

Within each id, with the observations in their original order, we create a running count of runs of consecutive missing or consecutive non-missing values of gr. But I suppose that is just a slight rephrasing of the comment in the code, so perhaps not helpful. I think that it is hard to put in words, but easy to see. Just run the first two lines of code and then -browse- the data. I think it will be easy to spot what has happened.

Professor, I saw what happened to the original data after running the 1st 2 lines of code. I don't understand the gramma within the sum (missing(gr) != missing(gr[_n-1]))
for example, != missing(gr[_n-1]) means what?. Thank you for your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#7

24 Jul 2022, 18:55

-!= missing(gr[_n-1])- by itself means nothing.

-missing(gr) != missing(gr[_n-1])- is a logical expression. Let's unpack it. missing(x) means the value of x is missing. This is either true or false for any x. gr[_n-1] means, in any observation, the value of gr in the preceding observation. And != is the "is not equal" operator. So -missing(gr) != missing(gr[_n-1]) is true if and only if either gr is missing in the current observations, and it isn't missing in the preceding observation, or, it is not missing in the current observation but it is missing in the preceding one. Another way of putting it is that the missingness of gr in the present and preceding observations are different.

The -sum()- function in Stata calculates a running sum. Now, how do we sum logical expressions? In Stata, when a logical expression is calculated, false is represented by 0 and true is represented by 1.

So -sum(missing(gr) != missing(gr[_n-1]))- gives a running count of how many times gr changed from missing to non-missing, or from non-missing to missing, in the data so far.
1 like
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#8

24 Jul 2022, 19:40

Thank you!
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#9

24 Jul 2022, 20:07

Originally posted by Clyde Schechter View Post

-!= missing(gr[_n-1])- by itself means nothing.

-missing(gr) != missing(gr[_n-1])- is a logical expression. Let's unpack it. missing(x) means the value of x is missing. This is either true or false for any x. gr[_n-1] means, in any observation, the value of gr in the preceding observation. And != is the "is not equal" operator. So -missing(gr) != missing(gr[_n-1]) is true if and only if either gr is missing in the current observations, and it isn't missing in the preceding observation, or, it is not missing in the current observation but it is missing in the preceding one. Another way of putting it is that the missingness of gr in the present and preceding observations are different.

The -sum()- function in Stata calculates a running sum. Now, how do we sum logical expressions? In Stata, when a logical expression is calculated, false is represented by 0 and true is represented by 1.

So -sum(missing(gr) != missing(gr[_n-1]))- gives a running count of how many times gr changed from missing to non-missing, or from non-missing to missing, in the data so far.

Professor, I have to disturb you because I still don't understand the last line of your code--
by id (obs_no): drop if run_num == run_num[_N] & missing(gr[_N]) Could you please explain the gramma meaning of this line? Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#10

24 Jul 2022, 20:21

Within each group of observations for an id, in their original sort order, delete those where two things happen: the value of gr in the last observation is missing, and the value of run_num is the last value. In simpler words, delete the last group of consecutive missing observations.
1 like
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#11

24 Jul 2022, 21:05

Professor, Thank you very much! I understand it now.
Comment

Announcement

How to Distinctively Remove the Observations with Specific Missing Value Patterns in Stata?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment