Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to check if value of observation appears in other variables

    Hi,
    I would like to check if observations (the values of snl_institution_key, the first column in the example below) appear as observations in other variables (the subsequent columns below). In Excel it would done with vlookups, not sure how to to do it in Stata though. How would I go about doing this? Thanks!

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long snl_institution_key float(matchdir1_snl_institution_key matchdir2_snl_institution_key matchdir3_snl_institution_key matchdir4_snl_institution_key)
    4144711       .       .       .       .
    4253984       .       .       .       .
    4145025       .       .       .       .
    4263041       .       .       .       .
    4253960       .       .       .       .
    4253961       .       .       .       .
    4253380       .       .       .       .
    4253986       .       .       .       .
    4263265       .       .       .       .
    4185986       .       .       .       .
    4253962       .       .       .       .
     113830 4085960 3002056 4571062 4198738
    4429684       .       .       .       .
    4253973       .       .       .       .
    4263263       .       .       .       .
    4155878       .       .       .       .
    4145330       .       .       .       .
    4253904       .       .       .       .
    4263273       .       .       .       .
    4253979       .       .       .       .
    4253903       .       .       .       .
    4182942       .       .       .       .
    4253966       .       .       .       .
    4253963       .       .       .       .
    4253968       .       .       .       .
     113985 4044461 4242195 4242165 4145330
    4621655       .       .       .       .
    4401733       .       .       .       .
    4140555       .       .       .       .
    4112994       .       .       .       .
    4156487       .       .       .       .
    4198738       .       .       .       .
    4259070       .       .       .       .
    4155881       .       .       .       .
    4403760       .       .       .       .
    4400794       .       .       .       .
    4253844       .       .       .       .
    4394314       .       .       .       .
    4263318       .       .       .       .
    4155880       .       .       .       .
    4210296       .       .       .       .
    4145332       .       .       .       .
    4219330       .       .       .       .
    4328959       .       .       .       .
    4253323       .       .       .       .
    4242263       .       .       .       .
    4621874       .       .       .       .
    4622044       .       .       .       .
    4621393       .       .       .       .
    4621855       .       .       .       .
    4621791       .       .       .       .
    4621517       .       .       .       .
    4538595       .       .       .       .
    4559164       .       .       .       .
    4263351       .       .       .       .
    4253679       .       .       .       .
    4620983       .       .       .       .
    4576215       .       .       .       .
    4621452       .       .       .       .
    4621431       .       .       .       .
    4620401       .       .       .       .
    4402001       .       .       .       .
    4620989       .       .       .       .
    4144343       .       .       .       .
    4622086       .       .       .       .
    4262988       .       .       .       .
    4253631       .       .       .       .
    4187359       .       .       .       .
    4622084       .       .       .       .
    4620797       .       .       .       .
    4620305       .       .       .       .
    4621461       .       .       .       .
    4621644       .       .       .       .
    4619674       .       .       .       .
    4620986       .       .       .       .
    4263176       .       .       .       .
    4620385       .       .       .       .
    4407379       .       .       .       .
    4213238       .       .       .       .
    4622043       .       .       .       .
    4577844       .       .       .       .
    4620368       .       .       .       .
    4547981       .       .       .       .
    4620765       .       .       .       .
    4622081       .       .       .       .
    4263247       .       .       .       .
    4621465       .       .       .       .
    4138555       .       .       .       .
    4620314       .       .       .       .
    4155882       .       .       .       .
    4089743 4155878 4144343 4145025 4145113
    4263902       .       .       .       .
    4387081       .       .       .       .
    4242197       .       .       .       .
    4253340       .       .       .       .
    4145042 4112994 4139624 4141303 4144343
    4328940       .       .       .       .
    4402747       .       .       .       .
    4628094       .       .       .       .
    4621615       .       .       .       .
    end
    Last edited by John Vourdas; 13 May 2018, 14:41.

  • #2
    You might try Kevin Crow's vlookup.

    Code:
    search vlookup
    to locate and download.

    Comment


    • #3
      Thanks for this. Actually I just want to check if the observation value appears in another variable. Thus far I tried looping over observations but this takes ages. Is there no inbuilt function for this fairly trivial and relatively useful task?

      Comment


      • #4
        Note that your match variables are stored as float, a data type that cannot store more than 7 decimal digits accurately; make sure that you do not have identifiers with more than 7 digits. I'll also assume that each observation is uniquely identified by the value of snl_institution_key. Your unwieldy variable names will make for unwieldy code so I took the liberty of simplifying your variable names. So if you process the data example presented in #1 with the following code:
        Code:
        * shorten unwieldy variable names
        rename matchdir*_snl_institution_key match*
        rename snl_institution_key snl_id
        isid snl_id, sort
        save "dataex.dta", replace
        list in 1/5
        the first few observations of the data, now ordered by snl_id will look like:
        Code:
        . list in 1/5
        
             +-------------------------------------------------+
             |  snl_id    match1    match2    match3    match4 |
             |-------------------------------------------------|
          1. |  113830   4085960   3002056   4571062   4198738 |
          2. |  113985   4044461   4242195   4242165   4145330 |
          3. | 4089743   4155878   4144343   4145025   4145113 |
          4. | 4112994         .         .         .         . |
          5. | 4138555         .         .         .         . |
             +-------------------------------------------------+
        The standard way to match values in Stata is to use merge. This is not particularly hard but the finer details depend on if you expect multiple matches and want to track these, including in which variables they are found. In the following, I keep it simple and assume that you want to find if the snl_id value is found in any of the match variables. The strategy is to make a list of matched values. Since these are in a wide layout, you need to use reshape to convert them to a long layout. With one observation per value to match, you then use merge to match these values with the initial snl_id codes.
        Code:
        * use merge strategy
        use "dataex.dta", clear
        keep snl_id match*
        reshape long match, i(snl_id)
        bysort match: keep if _n == 1
        keep if !mi(match)
        keep match
        list
        
        rename match snl_id
        merge 1:1 snl_id using "dataex.dta",
        drop if _merge == 1
        sort snl_id
        gen foundit = _merge == 3
        list snl_id if foundit
        The list of codes to match is (results from the first list command):
        Code:
        . list
        
             +---------+
             |   match |
             |---------|
          1. | 3002056 |
          2. | 4044461 |
          3. | 4085960 |
          4. | 4112994 |
          5. | 4139624 |
             |---------|
          6. | 4141303 |
          7. | 4144343 |
          8. | 4145025 |
          9. | 4145113 |
         10. | 4145330 |
             |---------|
         11. | 4155878 |
         12. | 4198738 |
         13. | 4242165 |
         14. | 4242195 |
         15. | 4571062 |
             +---------+
        and the final list:
        Code:
        . list snl_id if foundit
        
             +---------+
             |  snl_id |
             |---------|
          4. | 4112994 |
          7. | 4144343 |
          9. | 4145025 |
         11. | 4145330 |
         13. | 4155878 |
             |---------|
         21. | 4198738 |
             +---------+
        You can also use rangestat (from SSC) to perform the lookup. It's a bit tricky to use because with rangestat, the variable you want to look up (the key variable) cannot contain missing values. The workaround is simple, you replace missing values by a code that does not exist in snl_id (I used -999). In the example below, v2use is the key variable and, for each observation, rangestat will prepare a data subset with all observations where the value of v2use is within an interval that starts and ends with the value of snl_id for the current observation. The (count) statistic will return the number of observations in this subset where v2use is not missing.
        Code:
        use "dataex.dta", clear
        forvalues i=1/4 {
            gen long v2use = cond(mi(match`i'), -999, match`i')
            rangestat (count) v2use, interval(v2use snl_id snl_id)
            rename v2use_count count`i'
            drop v2use
        }
        egen overall = rowtotal(count*)
        list snl_id count* overall if overall > 0
        and the results
        Code:
        . list snl_id count* overall if overall > 0
        
             +-------------------------------------------------------+
             |  snl_id   count1   count2   count3   count4   overall |
             |-------------------------------------------------------|
          4. | 4112994        1        .        .        .         1 |
          7. | 4144343        .        1        .        1         2 |
          9. | 4145025        .        .        1        .         1 |
         11. | 4145330        .        .        .        1         1 |
         13. | 4155878        1        .        .        .         1 |
             |-------------------------------------------------------|
         21. | 4198738        .        .        .        1         1 |
             +-------------------------------------------------------+

        Comment

        Working...
        X