Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Distinctively Remove the Observations with Specific Missing Value Patterns in Stata?

    Hello, I have a small dataset as follows,
    clear
    input byte (id gr)
    1 1
    1 2
    1 2
    1 .
    1 .
    1 .
    2 1
    2 2
    2 3
    2 .
    2 .
    2 .
    2 5
    2 6
    3 1
    3 .
    3 3
    3 4
    3 .
    3 6
    4 1
    4 1
    4 .
    4 .
    4 .
    4 .
    4 .
    end

    I want to delete the data with the continuous missing value from specific position to the last value in the variable of "gr" within id, like the data with id==4 and id==1.
    As for the data with id==2 and id==3, although they have missing values, the missing values from specific position to the last value in the variable of "gr" within id is not continuous.
    So, the data with id==2 and id==3 should be kept.
    Can someone help me with Stata code?
    Thank you!

  • #2
    Code:
    //  MARK CURRENT ORDERING
    gen long obs_no = _n
    
    //  MARK RUNS OF MISSING VALUES/ NON-MISSING VALUES
    by id (obs_no), sort: gen run_num = sum(missing(gr) != missing(gr[_n-1]))
    
    //  REMOVE A FINAL RUN OF ALL MISSING VALUES
    by id (obs_no): drop if run_num == run_num[_N] & missing(gr[_N])

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Code:
      // MARK CURRENT ORDERING
      gen long obs_no = _n
      
      // MARK RUNS OF MISSING VALUES/ NON-MISSING VALUES
      by id (obs_no), sort: gen run_num = sum(missing(gr) != missing(gr[_n-1]))
      
      // REMOVE A FINAL RUN OF ALL MISSING VALUES
      by id (obs_no): drop if run_num == run_num[_N] & missing(gr[_N])
      Could you explain the second line of your code, thank you!
      I don't fully understand the meaning of it.

      Comment


      • #4
        Within each id, with the observations in their original order, we create a running count of runs of consecutive missing or consecutive non-missing values of gr. But I suppose that is just a slight rephrasing of the comment in the code, so perhaps not helpful. I think that it is hard to put in words, but easy to see. Just run the first two lines of code and then -browse- the data. I think it will be easy to spot what has happened.

        Comment


        • #5
          Thank you, professor!

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Within each id, with the observations in their original order, we create a running count of runs of consecutive missing or consecutive non-missing values of gr. But I suppose that is just a slight rephrasing of the comment in the code, so perhaps not helpful. I think that it is hard to put in words, but easy to see. Just run the first two lines of code and then -browse- the data. I think it will be easy to spot what has happened.
            Professor, I saw what happened to the original data after running the 1st 2 lines of code. I don't understand the gramma within the sum (missing(gr) != missing(gr[_n-1]))
            for example, != missing(gr[_n-1]) means what?. Thank you for your help!

            Comment


            • #7
              -!= missing(gr[_n-1])- by itself means nothing.

              -missing(gr) != missing(gr[_n-1])- is a logical expression. Let's unpack it. missing(x) means the value of x is missing. This is either true or false for any x. gr[_n-1] means, in any observation, the value of gr in the preceding observation. And != is the "is not equal" operator. So -missing(gr) != missing(gr[_n-1]) is true if and only if either gr is missing in the current observations, and it isn't missing in the preceding observation, or, it is not missing in the current observation but it is missing in the preceding one. Another way of putting it is that the missingness of gr in the present and preceding observations are different.

              The -sum()- function in Stata calculates a running sum. Now, how do we sum logical expressions? In Stata, when a logical expression is calculated, false is represented by 0 and true is represented by 1.

              So -sum(missing(gr) != missing(gr[_n-1]))- gives a running count of how many times gr changed from missing to non-missing, or from non-missing to missing, in the data so far.

              Comment


              • #8
                Thank you!

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  -!= missing(gr[_n-1])- by itself means nothing.

                  -missing(gr) != missing(gr[_n-1])- is a logical expression. Let's unpack it. missing(x) means the value of x is missing. This is either true or false for any x. gr[_n-1] means, in any observation, the value of gr in the preceding observation. And != is the "is not equal" operator. So -missing(gr) != missing(gr[_n-1]) is true if and only if either gr is missing in the current observations, and it isn't missing in the preceding observation, or, it is not missing in the current observation but it is missing in the preceding one. Another way of putting it is that the missingness of gr in the present and preceding observations are different.

                  The -sum()- function in Stata calculates a running sum. Now, how do we sum logical expressions? In Stata, when a logical expression is calculated, false is represented by 0 and true is represented by 1.

                  So -sum(missing(gr) != missing(gr[_n-1]))- gives a running count of how many times gr changed from missing to non-missing, or from non-missing to missing, in the data so far.
                  Professor, I have to disturb you because I still don't understand the last line of your code--
                  by id (obs_no): drop if run_num == run_num[_N] & missing(gr[_N]) Could you please explain the gramma meaning of this line? Thank you!

                  Comment


                  • #10
                    Within each group of observations for an id, in their original sort order, delete those where two things happen: the value of gr in the last observation is missing, and the value of run_num is the last value. In simpler words, delete the last group of consecutive missing observations.

                    Comment


                    • #11
                      Professor, Thank you very much! I understand it now.

                      Comment

                      Working...
                      X