Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining keep and if

    Hi All,

    I have a super basic question. Consider the following dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(var1 var2 var3)
      2 1 2
      3 1 3
      2 1 2
      1 1 3
    123 0 2
      2 0 1
    end
    What I wish to do, is to only keep those observations of var1, where var2==1. The dataset that I want, is that when this condition is satisfied (var2==1), all I have left is the first 4 rows of var1, and only var1 as a column.
    This is achived by:

    Code:
    keep if var2==1
    keep var1
    However, if I am to type:

    Code:
    keep var1 if var2==1
    I get an invalid sytanx error. Is there any obvious reason for this? If the command keep drops everything except what is followed, and if what is follows is only var1 by a specific condition, I do not see a reason why this should not work.

    Many thanks,
    CS

  • #2
    One obvious answer to your question is that it does not work because the programmers of Stata decided that this should be the desired behavior of keep. Maybe it is easier that way from the perspective of the programmers. But I understand your confusion because almost all other commands would allow a combination of variables and if-restriction as inputs.

    Comment


    • #3
      Thanks a lot!

      Comment


      • #4
        I find Stata's language highly consistent here. In general, the if qualifier restricts the execution of a command to a specific subset of observations. From this perspective,

        Code:
        keep var1 if var2==1
        would tell Stata to keep var1 but only for those observations for which var2==1. What is supposed to happen to var1 (and all other variables) for observations that are excluded by the if qualifier? The result would probably be some sort of sparse data matrix that cannot be held as a dataset.

        Comment


        • #5
          What is supposed to happen to var1 (and all other variables) for observations that are excluded by the if qualifier?
          My naive expectation would be that Stata drops the excluded observations. So from the outside, the Stata's behavior for keep and drop is somewhat inconsistent, because other Stata commands don't differentiate between varlist and the if-qualifier in the same way that those commands do.
          You can achieve the desired result/behavior (from a user perspective) with combining two commands instead of one command. So you could write a keep2 program which does precisely that.
          So I still don't understand the exact reason why using
          varlist and the if-qualifier in the same keep command should not be possible.
          Granted, writing two commands instead of one is not a big deal. It just deviates from the expected behavior.

          The result would probably be some sort of sparse data matrix that cannot be held as a dataset.

          You should get a sparse matrix if the dropping of observations happens in-place. But if you first write the rows and then the columns which fulfil the criteria set by the
          keep command, then you should end up with a normal matrix/dataset.
          That's why my guess for this seemingly inconsistent behavior lies in behind-the-scenes considerations, maybe speed disadvantages, memory considerations, etc.

          Comment


          • #6
            Originally posted by daniel klein View Post
            I find Stata's language highly consistent here. In general, the if qualifier restricts the execution of a command to a specific subset of observations. From this perspective,

            Code:
            keep var1 if var2==1
            would tell Stata to keep var1 but only for those observations for which var2==1. What is supposed to happen to var1 (and all other variables) for observations that are excluded by the if qualifier? The result would probably be some sort of sparse data matrix that cannot be held as a dataset.
            HI Daniel,

            Perhaps I misunderstand your point. If the combination of commands:

            Code:
             
             keep if var2==1 keep var1
            are able to hold the sparse data matrix that you mention, then there should be no reason for the combination of those commands to not be able to hold it as well.

            Comment


            • #7
              My naive expectation would be that Stata drops the excluded observations.
              In no other case that I can think of does the if clause have any effect on observations excluded from the if clause, other than to fill in missing values for newly-created variables.

              I agree with Daniel on this. The benefits of reducing two commands to one are outweighed by the unintuitive interpretation of what the expected results are. On Statalist we often see questions from users who think that, given a dataset of 100 observations of 10 variables,, they can somehow "drop" 50 observations of one of the variables without affecting anything else. What they need to do, of course, is to replace the values in those observations with missing values. Supporting the combined syntax described, especially given the tendency of new users to ignore the documentation, would lead them to expect that drop could be told to do what they expect, and then they'd be confused as to why they lost 50 observations.

              If Stata were to do it over again, I expect they would build on the understanding that led to commands that have subcommands (label, import, and export, for example) and implement
              Code:
              keep variables varlist
              keep observations if expr
              rather than try to infer which version is needed. The could even allow convenience abbreviations (like bysort is0
              Code:
              keepv varlist
              keepo if expr
              Of course, nothing stops the user from writing their own mykeep command that runs two keep commands in succession from one set of arguments.

              Comment


              • #8
                Code:
                keep varlist if condition
                The issue that Daniel highlights in #4 is that the subset of variables being specified in -varlist- need not be the same as those used in -condition-. What you expect to happen is to reduce the dataset twice: first by -condition-, and then by keeping/dropping variables in -varlist-. However, the customary syntax of -if- everywhere else in Stata suggests a subtly different meaning, which is why it would be confusing for those familiar with Stata's syntax.

                I don't particularly find the desire to combine these two lines very compelling, as it's already very clear what the intention of separate -keep- operations is, while adding only 4 extra letters. On the other hand, someone might need to spend a few seconds understanding the combined command.

                Comment


                • #9
                  I think I missed the details that Leonardo and William are pointing out. I thought in terms of double reduction and not in terms of subsetting. I have not yet encountered a situation in which these differences in the interpretation of the if-qualifier mattered.
                  So thank you, Daniel, Leonardo and William for the explanations.

                  Comment

                  Working...
                  X