Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying matches between two string variables - with that may not be in the same row

    The following data is extracted from two Endnote searches. I want to compare the two searches (respectively named Var 1 and Var2) and identify values that are in Var1 but not Var 2 and values in Var2 that are not in Var1. As the data examples show, a value may be present in both variables but on different rows. For my purpose this is still a match. The variables have different numbers of rows.

    Advice on how this might be achieved would be appreciated. Would a sort be the first step?





    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str236 Var1
    `"Alexius, B., et al. (1995). "Involuntarily detained HIV-infected patients in Sweden: Reasons for referral and psychiatric diagnoses." Medicine and Law 14(7-8): 491-500."'                                                                   
    `"Alibekov, I. M. (1997). "Treatment of neurosensory hypoacusis of antibiotic and non-antibiotic etiology with mydocalm and nootropil: comparison of efficacy." Vestnik otorinolaringologii(4): 20-23."'                                       
    `"Alla, V., et al. (2010). "Aortic and coronary thrombosis following methamphetamine use and resuscitated SCD: Coincidence or shared pathogenesis?" Journal of General Internal Medicine 25: S493."'                                           
    `"Allott, K. and J. Redman (2007). "Are there sex differences associated with the effects of ecstasy/3,4-methylenedioxymethamphetamine (MDMA)?" Neuroscience And Biobehavioral Reviews 31(3): 327-347."'                                       
    `"Allott, K., et al. (2009). "Neuroendocrine and subjective responses to pharmacological challenge with citalopram: a controlled study in male and female ecstasy/MDMA users." Journal Of Psychopharmacology (Oxford, England) 23(7): 759-774."'
    `"Alsene, K. M., et al. (2005). "Effects of d-amphetamine and smoking abstinence on cue-induced cigarette craving." Experimental And Clinical Psychopharmacology 13(3): 209-218."'                                                             
    `"Dos Santos, R. G., et al. (2011). "Autonomic, neuroendocrine, and immunological effects of ayahuasca: a comparative study with d-amphetamine." Journal Of Clinical Psychopharmacology 31(6): 717-726."'                                      
    `"dos Santos, R. G., et al. (2014). "Salvinorin a and related compounds as therapeutic drugs for psychostimulant-related disorders." Current Drug Abuse Reviews 7(2): 128-132."'                                                               
    `"Downey, L. A., et al. (2015). "Reduced memory skills and increased hair cortisol levels in recent Ecstasy/MDMA users: significant but independent neurocognitive and neurohormonal deficits." Human Psychopharmacology 30(3): 199-207."'     
    `"Drabiak-Syed, K. (2011). "Reining In the pharmacological enhancement train: We Should remain vigilant about regulatory standards for prescribing controlled substances." Journal of Law, Medicine and Ethics 39(2): 272-279."'               
    `"Drevets, W. C., et al. (2001). "Amphetamine-induced dopamine release in human ventral striatum correlates with euphoria." Biological Psychiatry 49(2): 81-96."'                                                                              
    "D'Souza, M. S. and A. Markou (2010). Neural substrates of psychostimulant withdrawal-induced anhedonia. 3: 119-178."                                                                                                                           
    `"Ducours, J. L., et al. (1989). "[Craniofacial injuries and cerebral tomographic scintigraphy using N-isopropyl-iodo-amphetamine 123 I-AMP]." Revue De Stomatologie Et De Chirurgie Maxillo-Faciale 90(2): 89-94."'                           
    `"Duffau, B. E., et al. (2015). "High Performance Thin Layer Chromatography method for analysis of 3,4-methylenedioxymethamphetamine in seized tablets." Journal of Pharmacy and Pharmacognosy Research 3(6): 162-170."'                       
    `"Dumont, G. J. H. and R. J. Verkes (2006). "A review of acute effects of 3,4-methylenedioxymethamphetamine in healthy volunteers." Journal Of Psychopharmacology (Oxford, England) 20(2): 176-187."'                                          
    end
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str236 Var2
    `"Alessi, S. M., et al. (2003). "The prediction of individual differences in response to D-amphetamine in healthy adults." Behavioural Pharmacology 14(1): 19-32."'                                                                            
    `"Alexius, B., et al. (1995). "Involuntarily detained HIV-infected patients in Sweden: Reasons for referral and psychiatric diagnoses." Medicine and Law 14(7-8): 491-500."'                                                                   
    `"Alibekov, I. M. (1997). "Treatment of neurosensory hypoacusis of antibiotic and non-antibiotic etiology with mydocalm and nootropil: comparison of efficacy." Vestnik otorinolaringologii(4): 20-23."'                                       
    `"Alla, V., et al. (2010). "Aortic and coronary thrombosis following methamphetamine use and resuscitated SCD: Coincidence or shared pathogenesis?" Journal of General Internal Medicine 25: S493."'                                           
    `"Allott, K. and J. Redman (2007). "Are there sex differences associated with the effects of ecstasy/3,4-methylenedioxymethamphetamine (MDMA)?" Neuroscience And Biobehavioral Reviews 31(3): 327-347."'                                       
    `"Allott, K., et al. (2009). "Neuroendocrine and subjective responses to pharmacological challenge with citalopram: a controlled study in male and female ecstasy/MDMA users." Journal Of Psychopharmacology (Oxford, England) 23(7): 759-774."'
    `"Dlugos, A., et al. (2007). "Norepinephrine transporter gene variation modulates acute response to D-amphetamine." Biological Psychiatry 61(11): 1296-1305."'                                                                                 
    `"Docherty, J. R. (2008). "Pharmacology of stimulants prohibited by the World Anti-Doping Agency (WADA)." British Journal Of Pharmacology 154(3): 606-622."'                                                                                   
    `"Donaldson, C. D., et al. (2016). "Nonmedical use of prescription stimulants in college students: Attitudes, intentions, and vested interest." Addictive Behaviors 53: 101-107."'                                                             
    `"Dos Santos, R. G., et al. (2011). "Autonomic, neuroendocrine, and immunological effects of ayahuasca: a comparative study with d-amphetamine." Journal Of Clinical Psychopharmacology 31(6): 717-726."'                                      
    `"dos Santos, R. G., et al. (2014). "Salvinorin a and related compounds as therapeutic drugs for psychostimulant-related disorders." Current Drug Abuse Reviews 7(2): 128-132."'                                                               
    `"Downey, L. A., et al. (2015). "Reduced memory skills and increased hair cortisol levels in recent Ecstasy/MDMA users: significant but independent neurocognitive and neurohormonal deficits." Human Psychopharmacology 30(3): 199-207."'     
    `"Drabiak-Syed, K. (2011). "Reining In the pharmacological enhancement train: We Should remain vigilant about regulatory standards for prescribing controlled substances." Journal of Law, Medicine and Ethics 39(2): 272-279."'               
    `"Drevets, W. C., et al. (2001). "Amphetamine-induced dopamine release in human ventral striatum correlates with euphoria." Biological Psychiatry 49(2): 81-96."'                                                                               
    "D'Souza, M. S. and A. Markou (2010). Neural substrates of psychostimulant withdrawal-induced anhedonia. 3: 119-178."                                                                                                                           
    end

  • #2
    After running the code from your first dataex to read in Var1 I ran
    Code:
    tempfile file1
    save `file1'
    After then running the code from your second dataex to read in Var2 I ran
    Code:
    rename Var2 Var1
    merge 1:1 Var1 using `file1'
    generate start = substr(Var1,1,20)
    list start _merge, clean
    which gave me the following report
    Code:
                          start            _merge  
      1.   Alessi, S. M., et al   master only (1)  
      2.   Alexius, B., et al.        matched (3)  
      3.   Alibekov, I. M. (199       matched (3)  
      4.   Alla, V., et al. (20       matched (3)  
      5.   Allott, K. and J. Re       matched (3)  
      6.   Allott, K., et al. (       matched (3)  
      7.   D'Souza, M. S. and A       matched (3)  
      8.   Dlugos, A., et al. (   master only (1)  
      9.   Docherty, J. R. (200   master only (1)  
     10.   Donaldson, C. D., et   master only (1)  
     11.   Dos Santos, R. G., e       matched (3)  
     12.   Downey, L. A., et al       matched (3)  
     13.   Drabiak-Syed, K. (20       matched (3)  
     14.   Drevets, W. C., et a       matched (3)  
     15.   dos Santos, R. G., e       matched (3)  
     16.   Alsene, K. M., et al    using only (2)  
     17.   Ducours, J. L., et a    using only (2)  
     18.   Duffau, B. E., et al    using only (2)  
     19.   Dumont, G. J. H. and    using only (2)
    Last edited by William Lisowski; 22 Apr 2017, 06:35.

    Comment


    • #3
      William,

      Many thanks.This looks simple and elegant. However, I became lost at your first step: "After running the code from your first dataex"

      For ease I just named Var1, file1. I then tried to run the following code on two files,


      version 14
      cd "F:"
      set type double
      set more off
      clear

      use file1
      tempfile file1


      use Var2

      rename Var2 Var1
      merge 1:1 Var1 using `file1'
      generate start = substr(Var1,1,20)
      list start _merge, clean

      Here is the output received:

      use file1

      . tempfile file1

      .
      .
      . use Var2

      .
      . rename Var2 Var1

      . merge 1:1 Var1 using `file1'
      file C:\Users\Bob\AppData\Local\Temp\ST_00000001.tmp not found
      r(601);

      end of do-file

      r(601);


      I also tried, this without the temp file option

      use file1
      use Var2

      merge 1:1 Var2 using `file1'
      generate start = substr(Var2,1,20)
      list start _merge, clean

      use file1

      . use Var2

      .
      . merge 1:1 Var2 using `file1'
      invalid file specification
      r(198);

      end of do-file

      r(198);

      Any advice on what I am doing wrong would be helpful.In the real example, file 1 has 1509 rows and Var2 has 1469 (does substr(Var2,1,20) become substr(Var2, 1, 1469)?

      Thanks again.

      Bob




      Comment


      • #4
        When you issue the -tempfile file1- command, all that you are telling Stata to do is to assign the name file1 to a local macro that contains a name that could, eventually, be used for a temporary data file. It does not actually create the file. So later on when you try to -merge- with `file1', Stata finds no such file in existence. You have to -save `file1'- to actually create the file before you can make use of it.

        But it's not clear that you actually need any intermediary temporary files. You can just merge the data directly


        Code:
        use first_file, clear // CONTAINS VAR1
        rename var1 var2
        merge 1:1 var2 using second_file // SECOND FILE HAS VARIABLE NAMED var2
        I think William used a tempfile because he was starting not from an existing file but from the -dataex- output. So he had to create a file out of that. Notice that he does -save `file1'- immediately after declaring -tempfile file1-.

        Comment


        • #5
          Clyde,

          Thanks. I'm still struggling with this. Assuming first_file is file 1 above (Var1), why do I need to rename var1 to var2? If I merge using var1 renamed to var2, aren't I merging the same file, just renamed.

          use first_file, clear // CONTAINS VAR1 rename var1 var2 merge 1:1 var2 using second_file // SECOND FILE HAS VARIABLE NAMED var2




          Comment


          • #6
            I was finally able to perform the merge only after watching a Youtube video which employed the drop down menu. The key difference between the advice from William & Clyde above as well as what I have read, was that below I used the full path instead of just the file names.

            use "D:\stata\Data\first_file.dta"
            merge 1:1 _n using "D:\stata\Data\second_file.dta"
            generate start = substr(var1,1,20)
            list start _merge, clean

            I don't know why I needed the full path, but it was the only way I could get the job done.

            Comment


            • #7
              You would need the full path if you were working in a different directory or folder.

              Code:
              pwd
              tells which you are working in if it is not evident from the GUI.

              Comment


              • #8
                Following up on Nick's post #7, in post #3 you tell us you have issued the command
                Code:
                cd "F:"
                which sets your current (working) directory to F: and having done so,
                Code:
                use "first_file.dta"
                is then shorthand for
                Code:
                use "F:\first_file.dta"
                You needed
                Code:
                cd "D:\stata\Data"
                in order for
                Code:
                use "first_file.dta"
                to do what you wanted.

                More to the point, your command
                Code:
                merge 1:1 _n using "D:\stata\Data\second_file.dta"
                is not the command Clyde and I specified and did not give you the results you desire. Reading the output of
                Code:
                help merge
                will inform you that your command paired the first observation in first_file with the first observation in second_file, regardless of whether Var1 in first_file was the same as Var2 in second_file. So if you were to actually look at the results you obtain from this code, you will see that you have both Var1 and Var2 and in general they are not equal to each other. All the _merge variable tells you is that it found a first observation in each file, a second observation in each file, and so forth.

                My belief is that what you want to obtain eventually is a list of distinct citations from your Endnote data, with an indication for each of whether it came from the first search results in first_file or the second search results in second_file or from both. To use merge to accomplish this, the variables need to have the same name in both files, and you need to use them, not _n, as the key to be matched between the two files. The _merge variable which is automatically created will then tell you which file each appeared in.

                Comment


                • #9
                  Thanks Nick & William,

                  Due to frustration I created a new directory in D (to start afresh) but had been working from F. I agree it is confusing using the cd "F:" code, whch in some instances but not all may have caused problems.

                  Regarding the code "merge 1:1 _n" that William detected, this is a case of looking but not seeing the difference. However, when I tried the original code I'm back to errors:

                  use "D:\stata\Data\first_file.dta" , clear
                  merge 1:1 var1 using "D:\stata\Data\second_file.dta"
                  generate start = substr(var1,1,20)
                  list start _merge, clean

                  use "D:\stata\Data\first_file.dta"

                  . merge 1:1 var1 using "D:\stata\Data\second_file.dta"
                  variable var1 does not uniquely identify observations in the master data
                  r(459);

                  end of do-file

                  r(459);

                  . describe

                  Contains data from D:\stata\Data\first_file.dta
                  obs: 1,508
                  vars: 1 23 Apr 2017 18:06
                  size: 877,656
                  --------------------------------------------------------------------------------------------------------------------------------------------
                  storage display value
                  variable name type format label variable label
                  --------------------------------------------------------------------------------------------------------------------------------------------
                  var1 str582 %582s Original
                  --------------------------------------------------------------------------------------------------------------------------------------------
                  Sorted by: var1


                  use second_file

                  . describe

                  Contains data from second_file.dta
                  obs: 1,469
                  vars: 1 23 Apr 2017 18:07
                  size: 854,958
                  --------------------------------------------------------------------------------------------------------------------------------------------
                  storage display value
                  variable name type format label variable label
                  --------------------------------------------------------------------------------------------------------------------------------------------
                  var1 str582 %582s New
                  --------------------------------------------------------------------------------------------------------------------------------------------
                  Sorted by:


                  With "merge 1:1 _n" there were 39 references listed as "using only (2)" so I planned to check these manually to see what was detected. A quick check reveals _n detected the last 39 references.

                  I went back to William's & Clyde's posts and see they included a rename command, I tried this :

                  use "D:\stata\Data\first_file.dta", clear
                  rename var1 var2
                  merge 1:1 var2 using "D:\stata\Data\second_file.dta"
                  generate start = substr(var2,1,20)
                  list start _merge, clean

                  and receive this error - merge 1:1 var2 using "D:\stata\Data\second_file.dta"
                  variable var2 not found

                  There is something that I am not getting with all this. I re-read the help files but I still can't see why this doesn't work.

                  Thanks again

                  Comment


                  • #10
                    In your example data, you the first file contained a variable called var1, and the second contained a variable called var2. Despite their different names, they represent the same things and you are trying to find which values occur in each data set separately, and which occur in both data sets. Stata is telling you that this description of affairs is not the reality of your data. It appears that your second data set does not contain a variable named var2. So the question is, what variable does it contain? (Note: Stata variable names are case sensitive, so if it contains VAR2, or Var2, etc., Stata will not find a match to var2.) The point is that to do a -merge- on data sets where you pair up on a common variable, the variable has to have the same name in both data sets. So you have to rename the variable in the first data set to have the same name as the corresponding variable in the second data set.

                    The other issue is that for a 1:1 merge to work, that same variable must uniquely identify observations in both data sets. So if that is not the case (and, apparently it is not based on the other error message you got), then you have to -duplicates drop var1- in the first data set, and perhaps -duplicates drop var2- in the second data set before trying to -merge- them.

                    Now let me anticipate a problem you may encounter with -duplicates drop ...- Stata may find that the observations that are duplicates on var1 may differ on other variables in the data set. (In your example data, there are no other variables, so this would not be an issue, but we've already established that your example data was an oversimplification.) If that is the case, Stata will not let you do -duplicates drop- because you would be losing data. So if you run into that problem, given the very limited purpose of this exercise, you can add the -force- option to -duplicates drop var1- and -duplicates drop var2-. I do not ordinarily recommend using -duplicates drop anything, force- because the loss of information can be problematic. But in this case, all we care about are the values of var1 and var2 themselves: for the present purpose every other variable is irrelevant. So losing the information about those other variables does no harm for this purpose. I'm emphasizing that so that you don't get into the habit of using the -force- option whenever you encounter a message that there is a problem. You should only use it in situations like this where a careful and thorough analysis leads to the conclusion that the loss of information is irrelevant.

                    So the overall approach assumes you have dataset1 with variable var1, and dataset2 with variable var2.

                    Code:
                    use dataset2, clear
                    duplicates drop var2 // POSSIBLY WITH -force- OPTION
                    tempfile two
                    save `two'
                    
                    use dataset1, clear
                    duplicates drop var1 // POSSIBLY WITH -force- OPTION
                    rename var1 var2
                    merge 1:1 var2 using `two'
                    The -merge- variable will now by 1 for values that came only from dataset1, 2 for values that came only from datset2, and 3 for values that are found in both data sets.

                    Comment


                    • #11
                      Clyde's done a yeoman's job on this in post #10.

                      I'd like to take a step back: depending on your objectives, using merge may be making more trouble for you than it's worth.

                      What is it you want from this process? In post #8 I speculated on what you wanted, but you haven't confirmed or denied that. Here are three possibilities:
                      1. A list of distinct citations (that is, at most one copy of each) from your Endnote queries, with an indication for each of whether it came from the first search results in first_file or the second search results in second_file or from both.
                      2. A list of distinct citations from your Endnote queries, and a count of how many times each citation occurs in each set of query results.
                      3. A list of distinct citations from your Endnote queries, and you don't care which query it came from, nor how many times it was found by each query. And perhaps you actually have a third and a fourth and a fifth Endnote query you haven't told us about yet, and you believe you can generalize what you get from this limited exercise to the larger problem.
                      In the first case, proceed along the lines Clyde indicated. In the second or third cases, tell us - because there are simpler approaches using the append command. Like the approach using merge, it will depend on having your Endnote query response variable being named identically in both datasets. I realize that seems unintuitive and confusing to you, but in another thread you established that you are an infrequent user of Stata whose expertise is limited to the tasks you need to do routinely, so your intuition in this new area is not to be trusted. If possible, you should consider rerunning the programs that created first_file and second_file from the Endnote query data, and change them to create new files with the variable name the same in both cases. If you want to be able to tell from the data which query the result came from, add
                      Code:
                      generate querynumber = 1
                      to the code for first_file and
                      Code:
                      generate querynumber = 2
                      to the code for the second file.

                      On another issue, to present data, code, and results readably, please copy them from the Results window or your log file into a code block in the Forum editor, as explained in the Statalist FAQ linked to at the top of the page. For example, the following:

                      [code]
                      // sample code
                      sysuse auto, clear
                      describe
                      [/code]

                      will be presented in the post as the following:
                      Code:
                      // sample code
                      sysuse auto, clear
                      describe
                      You will note that the syntax is just like what the dataex provided in the material used in your example in post #1.

                      Comment


                      • #12
                        William & Clyde,

                        Many thanks for your replies. I'll test out the code tonight. In relation to William's question, I want to ascertain what citations are in search 1 and not search 2; and vice versa. The number of discrepant citations is of value, however, it is the specific citations that I want to see.

                        Comment


                        • #13
                          Using one file, I trialed Clyde's suggestions to run drop duplicates.
                          Code:
                          clear
                          use "D:\stata\Data\first_file - Copy.dta"
                          duplicates drop var1, force
                          I had to use the force option, because I received a message: "varlist not allowed"

                          When
                          Code:
                           duplicates
                          is used without the
                          Code:
                           force
                          option, I read it it is possible to obtain the list of duplicates. When force is used is it possible to obtain a list of duplicates?

                          If this is possible, I'll attempt what William suggested: "If possible, you should consider rerunning the programs that created first_file and second_file from the Endnote query data, and change them to create new files with the variable name the same in both cases." Does this output below suggest otherwise?






                          Last edited by Bob Green; 24 Apr 2017, 07:55.

                          Comment


                          • #14
                            When force is used is it possible to obtain a list of duplicates?
                            Just use
                            Code:
                            duplicates list var1
                            to obtain the list of observations with duplicated values of var1. The
                            Code:
                            force
                            option does not affect the determination of which values are duplicated. As the output of
                            Code:
                            help duplicates
                            tells you, the
                            Code:
                            force
                            option is required to drop observations when you specify a varlist, as you do with
                            Code:
                            duplicates drop var1
                            but the purpose of the option is only to remind you that you may be losing information, because variables other than var1 may differ for observations with the same value of var1. Other than that, the
                            Code:
                            force
                            option has no effect on the command.

                            Added: because your var1 is a str582 variable, you may find using
                            Code:
                            duplicates list var1, string(20)
                            to limit the display of var1 to the first 20 characters to be helpful.
                            Last edited by William Lisowski; 24 Apr 2017, 08:47.

                            Comment


                            • #15
                              Thanks William & Clyde,

                              The following code now runs without error

                              Code:
                              version 14
                              cd "D:\stata\Data\"
                              set type double
                              set more off
                              clear
                              
                              use second_file, clear
                              duplicates drop var1, force
                              tempfile two
                              save `two'
                              describe
                              
                              use first_file, clear
                              duplicates drop var1, force
                              merge 1:1 var1 using `two'
                              describe
                              The results:

                              . use first_file, clear

                              . duplicates drop var1, force

                              Duplicates in terms of var1

                              (0 observations are duplicates)

                              . * rename var1 var2
                              . merge 1:1 var1 using `two'

                              Result # of obs.
                              -----------------------------------------
                              not matched 24
                              from master 8 (_merge==1)
                              from using 16 (_merge==2)

                              matched 1,453 (_merge==3)
                              -----------------------------------------

                              . describe

                              Contains data from first_file.dta
                              obs: 1,477
                              vars: 2 24 Apr 2017 23:00
                              size: 861,091
                              --------------------------------------------------------------------------------------------------------------------------------------------
                              storage display value
                              variable name type format label variable label
                              --------------------------------------------------------------------------------------------------------------------------------------------
                              var1 str582 %582s
                              _merge byte %23.0g _merge
                              --------------------------------------------------------------------------------------------------------------------------------------------
                              Sorted by:
                              Note: Dataset has changed since last saved.

                              .
                              end of do-file


                              I will now go back to the searches and double check.The issue of variables and files confounded me. Previously, I have done something similar in R, where I simply had the two variables in a single file and compared the two variables (via setdiff).

                              Thanks again.


                              Comment

                              Working...
                              X