Combining keep and if

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

Combining keep and if

13 Jul 2020, 13:42

Hi All,

I have a super basic question. Consider the following dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(var1 var2 var3) 2 1 2 3 1 3 2 1 2 1 1 3 123 0 2 2 0 1 end

What I wish to do, is to only keep those observations of var1, where var2==1. The dataset that I want, is that when this condition is satisfied (var2==1), all I have left is the first 4 rows of var1, and only var1 as a column.
This is achived by:

Code:

keep if var2==1 keep var1

However, if I am to type:

Code:

keep var1 if var2==1

I get an invalid sytanx error. Is there any obvious reason for this? If the command keep drops everything except what is followed, and if what is follows is only var1 by a specific condition, I do not see a reason why this should not work.

Many thanks,
CS
Tags: None
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#2

13 Jul 2020, 15:04

One obvious answer to your question is that it does not work because the programmers of Stata decided that this should be the desired behavior of keep. Maybe it is easier that way from the perspective of the programmers. But I understand your confusion because almost all other commands would allow a combination of variables and if-restriction as inputs.
1 like
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#3

13 Jul 2020, 15:17

Thanks a lot!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#4

14 Jul 2020, 02:31

I find Stata's language highly consistent here. In general, the if qualifier restricts the execution of a command to a specific subset of observations. From this perspective,

Code:

keep var1 if var2==1

would tell Stata to keep var1 but only for those observations for which var2==1. What is supposed to happen to var1 (and all other variables) for observations that are excluded by the if qualifier? The result would probably be some sort of sparse data matrix that cannot be held as a dataset.
2 likes
Comment
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#5

14 Jul 2020, 08:53

What is supposed to happen to var1 (and all other variables) for observations that are excluded by the if qualifier?

My naive expectation would be that Stata drops the excluded observations. So from the outside, the Stata's behavior for keep and drop is somewhat inconsistent, because other Stata commands don't differentiate between varlist and the if-qualifier in the same way that those commands do.
You can achieve the desired result/behavior (from a user perspective) with combining two commands instead of one command. So you could write a keep2 program which does precisely that.
So I still don't understand the exact reason why using varlist and the if-qualifier in the same keep command should not be possible.
Granted, writing two commands instead of one is not a big deal. It just deviates from the expected behavior.

The result would probably be some sort of sparse data matrix that cannot be held as a dataset.

You should get a sparse matrix if the dropping of observations happens in-place. But if you first write the rows and then the columns which fulfil the criteria set by the keep command, then you should end up with a normal matrix/dataset.
That's why my guess for this seemingly inconsistent behavior lies in behind-the-scenes considerations, maybe speed disadvantages, memory considerations, etc.
1 like
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#6

14 Jul 2020, 08:56

Originally posted by daniel klein View Post

I find Stata's language highly consistent here. In general, the if qualifier restricts the execution of a command to a specific subset of observations. From this perspective,

Code:

keep var1 if var2==1

would tell Stata to keep var1 but only for those observations for which var2==1. What is supposed to happen to var1 (and all other variables) for observations that are excluded by the if qualifier? The result would probably be some sort of sparse data matrix that cannot be held as a dataset.

HI Daniel,

Perhaps I misunderstand your point. If the combination of commands:

Code:

keep if var2==1 keep var1

are able to hold the sparse data matrix that you mention, then there should be no reason for the combination of those commands to not be able to hold it as well.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

14 Jul 2020, 10:19

My naive expectation would be that Stata drops the excluded observations.

In no other case that I can think of does the if clause have any effect on observations excluded from the if clause, other than to fill in missing values for newly-created variables.

I agree with Daniel on this. The benefits of reducing two commands to one are outweighed by the unintuitive interpretation of what the expected results are. On Statalist we often see questions from users who think that, given a dataset of 100 observations of 10 variables,, they can somehow "drop" 50 observations of one of the variables without affecting anything else. What they need to do, of course, is to replace the values in those observations with missing values. Supporting the combined syntax described, especially given the tendency of new users to ignore the documentation, would lead them to expect that drop could be told to do what they expect, and then they'd be confused as to why they lost 50 observations.

If Stata were to do it over again, I expect they would build on the understanding that led to commands that have subcommands (label, import, and export, for example) and implement

Code:

keep variables varlist keep observations if expr

rather than try to infer which version is needed. The could even allow convenience abbreviations (like bysort is0

Code:

keepv varlist keepo if expr

Of course, nothing stops the user from writing their own mykeep command that runs two keep commands in succession from one set of arguments.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#8

14 Jul 2020, 10:20

Code:

keep varlist if condition

The issue that Daniel highlights in #4 is that the subset of variables being specified in -varlist- need not be the same as those used in -condition-. What you expect to happen is to reduce the dataset twice: first by -condition-, and then by keeping/dropping variables in -varlist-. However, the customary syntax of -if- everywhere else in Stata suggests a subtly different meaning, which is why it would be confusing for those familiar with Stata's syntax.

I don't particularly find the desire to combine these two lines very compelling, as it's already very clear what the intention of separate -keep- operations is, while adding only 4 extra letters. On the other hand, someone might need to spend a few seconds understanding the combined command.
1 like
Comment
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#9

14 Jul 2020, 12:39

I think I missed the details that Leonardo and William are pointing out. I thought in terms of double reduction and not in terms of subsetting. I have not yet encountered a situation in which these differences in the interpretation of the if-qualifier mattered.
So thank you, Daniel, Leonardo and William for the explanations.
Comment

Announcement

Combining keep and if

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment