Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying if an observation in one variable appears in another variable

    Dear all,
    I am working on a project and have run into a problem with my dataset.
    I have a dataset with two variables labled ID1 and ID2. I want to generate a dummy variable that is equal to 1 if the observation in ID1 is found among all the observations in ID2, and 0 otherwise. Do you have any advice on how to do this?

  • #2
    countmatch (SSC) counts matches. Here is a toy example:

    Code:
     clear 
     
    input id1 id2 
    1    5      
    2    6
    3    7
    4    8 
    5. 5    9 
    end 
    
    . 
    countmatch id1 id2, gen(count) 
    
    . 
    list 
    
         +-------------------+
         | id1   id2   count |
         |-------------------|
      1. |   1     5       0 |
      2. |   2     6       0 |
      3. |   3     7       0 |
      4. |   4     8       0 |
      5. |   5     9       1 |
         +-------------------+
    In this case by construction there is just one match and the count could be used as an indicator variable. In other situations, something like

    Code:
    gen indicator = count > 0
    will be needed.

    Comment


    • #3
      In #2 anyone wishing to repeat the code should please note this revision.

      Code:
      input id1 id2  
      1    5      
      2    6
      3    7
      4    8  
      5    9  
      end

      Comment


      • #4
        Thank you! This was exactly what I wanted to do and really helpful.

        Comment


        • #5
          If the variables are numeric, you can also do this with rangestat (from SSC). In this case, you define an interval to count the number of observations where the value of id2 is the same as id1 for the current observation:

          Code:
          clear
          input id1 id2  
          1    5      
          2    6
          3    7
          4    8  
          5    9  
          end
          
          rangestat (count) n=id2, interval(id2 id1 id1)
          gen wanted = cond(mi(n), 0, 1)
          list

          Comment


          • #6
            Yes; rangestat (Picard, Cox, Ferrer) makes countmatch (Cox) pretty much redundant.

            Comment


            • #7
              Robert or NIck,
              I found this post as a potentially simple solution to a very similar problem. I have observations numbers of matched firms (from teffects nnmatch) as a variable. But I need to then code those observations listed as being part of a control group. I have created a running index variable (gen obs = _n) to compare with the variable containing the observation numbers of matched firms. This post appears to do what I want, but it is not working as desired. The rangestat works as described for the id1 and id2 example posted, but not necessarily with other specifications of data. Please see the additional data for comparison:
              Code:
              Code:
               clear
              input id1 id2 id3
              1    5    2
              2    6    . 
              3    7    .
              4    8    3
              5    9    3
              end
              
               
              rangestat (count) n2=id2, interval(id2 id1 id1)
              rangestat (count) n3=id3, interval(id3 id1 id1)

              n2 shows a 1 for the fifth observation, which indicates that there was exactly one instance of the number 5 in id2. However, n3 is empty. I was expecting to see a 1 for the second observation and a 2 for the third.

              Is there something I can do to get rangestat to perform in this manner?
              Thank you in advance!

              Comment


              • #8
                In general in Stata, missings will be ignored -- so

                Code:
                summarize x
                ignores missings --- unless they are the focus of attention -- so

                Code:
                count if missing(x)
                does what it is instructed to do.

                Does this do what you want?

                Code:
                 clear
                input id1 id2 id3
                1    5    2
                2    6    .
                3    7    .
                4    8    3
                5    9    3
                end
                
                
                mvencode id?, mv(0)
                
                rangestat (count) n2=id2, interval(id2 id1 id1)
                rangestat (count) n3=id3, interval(id3 id1 id1)
                
                
                
                 list
                
                     +---------------------------+
                     | id1   id2   id3   n2   n3 |
                     |---------------------------|
                  1. |   1     5     2    .    . |
                  2. |   2     6     0    .    1 |
                  3. |   3     7     0    .    2 |
                  4. |   4     8     3    .    . |
                  5. |   5     9     3    1    . |
                     +---------------------------+
                
                .

                Comment


                • #9
                  Nick,
                  Thank you for the mvencode refinement. All working now as desired.

                  Comment

                  Working...
                  X