Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using countmatch with nonunique variable

    Dear Community,

    I have the following two string variables in the dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str32 judgename str26 pco_pakistanallyears
    "A.N. Bhandari" ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  ""
    "Aalia Neelam"  "A.N. Bhandari"
    end

    Basically, I want to construct a dummy variable that satisfies three criteria:

    1) If the judgename variable string is found anywhere in the pco_pakistanallyears variable and then generate a new variable that takes the value of 1 if the string entry exists in both variables

    2) This dummy should take the value of 0 if string entry does NOT exist in pco_pakistanallyears variable but does exist in judgename (missing otherwise)

    3) Since there are multiple same entries for judgenames variable i.e. the same judge is mentioned multiple times, I want the new variable to take the value of 1 only one time i.e. when that judge was mentioned the first time.

    Count match satisfies the first two conditions but not the third one. The code, I tried by using help of count match is as follows:

    Code:
    countmatch judgename pco_pakistanallyears, gen(pcojudgesdummy)
    However, I want to count and construct the variable for unique matches of judgesname with pco_pakistanallyears. As per stata, help of countmatch, I probably, I have to use tag() function but everything I have tried does NOT work e.g. countmatch judgename pco_pakistanallyears, egen = tag(pcojudgesdummy) does not work

    Can any body help me out here? Thank you in advance!

    Cheers,
    Sultan
    Last edited by Sultan Mehmood; 15 Feb 2017, 16:37.

  • #2
    I am not sure why you want to do this. Flagging a match for only the "first time" seems arbitrary. Any reordering of the data can render the flagging meaningless.

    That said, I think your conditions can be satisfied. See the example below. BTW, you should note that countmatch is a user-written command.
    Code:
    clear
    input id str10 fruit1 str10 fruit2
    1 "Apple" "Organe"
    2 "Apple" ""
    3 "Apple" "Pear"
    4 "Banana" "Apple"
    5 "Banana" ""
    6 "Banana" "Orange"
    7 "Orange" ""
    8 "Orange" "Apple"
    9 "Pear" ""
    10 "Pear" "Pear"
    end
    sort fruit1 fruit2
    countmatch fruit1 fruit2, gen(match) // ssc install countmatch
    bysort fruit1: gen seq=_n
    bysort fruit1: replace match=1 if match!=0 & seq==1
    bysort fruit1: replace match=0 if seq!=1
    drop seq

    Comment


    • #3
      Thank you Aspen for this. Perhaps, this dummy variable is not what I am looking for if it seems counter-intuitive to you.

      MY INTENT:

      The reason, I wanted to construct a dummy that takes the value of 1 only the first time the judgename appears is because I wanted to compare judges who were NOT in pco_pakistanallyears string but in judgenames string with outcomes on those in pco_pakistanallyears string.
      .

      That is, some judges did something special (pco_pakistan) so I want to compare outcomes of judges who were in pcojudgenames and outcomes of judges in judgenames who were not in pco variable string. Since, the countmatch would count the same judge over different cases multiple times, I thought to make the dumy I specified above.

      Should I take an alternative approach?

      Thank you again for your reply.

      Cheers,
      Sultan

      P.S: In particular, I want to compare caselag in front of judgename variable, with caselag of judges in pco_pakistanallyears variables. Note that the caselag data is such that it is matched with corresponding judgename string entry while pco_pakistanallyears is added afterwards as subset of judgenames. The dataset is presented below:


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str32 judgename float caselag str26 pco_pakistanallyears
      "Jawad S. Khawaja"           1 "Khawja Muhammad Sharif"  
      ""                           0 "Mian Saqib Nisar"        
      "Iftikhar Hussain Chaudhry"  1 "Asif Saeed Khan Khosa"  
      "Khalil Ul Rahman"           . "Muhammad Tahir Ali"      
      "Nisar Hussain Khan"        13 "Ijaz Ahmad Chaudhry"    
      "Nisar Hussain Khan"        13 "M. A. Shahid Siddiqui"  
      "Mazhar Alam Khan Miankhel" 13 "Muhammad Jehangir Arshad"
      "Mazhar Alam Khan Miankhel"  0 "Sheikh Azmat Saeed"      
      "Raja Muhammad Khurshid"     . "Umar Atta Bandial"      
      "Hamid Farooq"               5 "Iqbal Hameed-Ur-Rehman"  
      "Iftikhar Hussain"           3 "Rahmat Hussain Jafferi"  
      "Muhammad Najam -Uz- Zamzn"  3 "Khilji Arif Hussain"    
      end
      Last edited by Sultan Mehmood; 16 Feb 2017, 03:48.

      Comment


      • #4
        If I am understanding your reasoning correctly, add this additional line to my example would simplify the data and address the issue.
        Code:
        bysort fruit1: keep if _n==1
        Please give the command a try. I still don't understand the basic unit of the data. Is this for social network analysis? That's the only context under which I can think of a data structure with the same set of possible values placed in two variables.

        Another note: does your caselag variable here correspond to the first or the third variable? The caselag values are different for the two entries starting with "Mazhar Alam Khan Mainkhel".

        Comment

        Working...
        X