Identifying matches between two string variables - with that may not be in the same row

Bob Green

Join Date: Oct 2015
Posts: 85

Identifying matches between two string variables - with that may not be in the same row

22 Apr 2017, 05:56

The following data is extracted from two Endnote searches. I want to compare the two searches (respectively named Var 1 and Var2) and identify values that are in Var1 but not Var 2 and values in Var2 that are not in Var1. As the data examples show, a value may be present in both variables but on different rows. For my purpose this is still a match. The variables have different numbers of rows.

Advice on how this might be achieved would be appreciated. Would a sort be the first step?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str236 Var1
`"Alexius, B., et al. (1995). "Involuntarily detained HIV-infected patients in Sweden: Reasons for referral and psychiatric diagnoses." Medicine and Law 14(7-8): 491-500."'                                                                   
`"Alibekov, I. M. (1997). "Treatment of neurosensory hypoacusis of antibiotic and non-antibiotic etiology with mydocalm and nootropil: comparison of efficacy." Vestnik otorinolaringologii(4): 20-23."'                                       
`"Alla, V., et al. (2010). "Aortic and coronary thrombosis following methamphetamine use and resuscitated SCD: Coincidence or shared pathogenesis?" Journal of General Internal Medicine 25: S493."'                                           
`"Allott, K. and J. Redman (2007). "Are there sex differences associated with the effects of ecstasy/3,4-methylenedioxymethamphetamine (MDMA)?" Neuroscience And Biobehavioral Reviews 31(3): 327-347."'                                       
`"Allott, K., et al. (2009). "Neuroendocrine and subjective responses to pharmacological challenge with citalopram: a controlled study in male and female ecstasy/MDMA users." Journal Of Psychopharmacology (Oxford, England) 23(7): 759-774."'
`"Alsene, K. M., et al. (2005). "Effects of d-amphetamine and smoking abstinence on cue-induced cigarette craving." Experimental And Clinical Psychopharmacology 13(3): 209-218."'                                                             
`"Dos Santos, R. G., et al. (2011). "Autonomic, neuroendocrine, and immunological effects of ayahuasca: a comparative study with d-amphetamine." Journal Of Clinical Psychopharmacology 31(6): 717-726."'                                      
`"dos Santos, R. G., et al. (2014). "Salvinorin a and related compounds as therapeutic drugs for psychostimulant-related disorders." Current Drug Abuse Reviews 7(2): 128-132."'                                                               
`"Downey, L. A., et al. (2015). "Reduced memory skills and increased hair cortisol levels in recent Ecstasy/MDMA users: significant but independent neurocognitive and neurohormonal deficits." Human Psychopharmacology 30(3): 199-207."'     
`"Drabiak-Syed, K. (2011). "Reining In the pharmacological enhancement train: We Should remain vigilant about regulatory standards for prescribing controlled substances." Journal of Law, Medicine and Ethics 39(2): 272-279."'               
`"Drevets, W. C., et al. (2001). "Amphetamine-induced dopamine release in human ventral striatum correlates with euphoria." Biological Psychiatry 49(2): 81-96."'                                                                              
"D'Souza, M. S. and A. Markou (2010). Neural substrates of psychostimulant withdrawal-induced anhedonia. 3: 119-178."                                                                                                                           
`"Ducours, J. L., et al. (1989). "[Craniofacial injuries and cerebral tomographic scintigraphy using N-isopropyl-iodo-amphetamine 123 I-AMP]." Revue De Stomatologie Et De Chirurgie Maxillo-Faciale 90(2): 89-94."'                           
`"Duffau, B. E., et al. (2015). "High Performance Thin Layer Chromatography method for analysis of 3,4-methylenedioxymethamphetamine in seized tablets." Journal of Pharmacy and Pharmacognosy Research 3(6): 162-170."'                       
`"Dumont, G. J. H. and R. J. Verkes (2006). "A review of acute effects of 3,4-methylenedioxymethamphetamine in healthy volunteers." Journal Of Psychopharmacology (Oxford, England) 20(2): 176-187."'                                          
end

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str236 Var2
`"Alessi, S. M., et al. (2003). "The prediction of individual differences in response to D-amphetamine in healthy adults." Behavioural Pharmacology 14(1): 19-32."'                                                                            
`"Alexius, B., et al. (1995). "Involuntarily detained HIV-infected patients in Sweden: Reasons for referral and psychiatric diagnoses." Medicine and Law 14(7-8): 491-500."'                                                                   
`"Alibekov, I. M. (1997). "Treatment of neurosensory hypoacusis of antibiotic and non-antibiotic etiology with mydocalm and nootropil: comparison of efficacy." Vestnik otorinolaringologii(4): 20-23."'                                       
`"Alla, V., et al. (2010). "Aortic and coronary thrombosis following methamphetamine use and resuscitated SCD: Coincidence or shared pathogenesis?" Journal of General Internal Medicine 25: S493."'                                           
`"Allott, K. and J. Redman (2007). "Are there sex differences associated with the effects of ecstasy/3,4-methylenedioxymethamphetamine (MDMA)?" Neuroscience And Biobehavioral Reviews 31(3): 327-347."'                                       
`"Allott, K., et al. (2009). "Neuroendocrine and subjective responses to pharmacological challenge with citalopram: a controlled study in male and female ecstasy/MDMA users." Journal Of Psychopharmacology (Oxford, England) 23(7): 759-774."'
`"Dlugos, A., et al. (2007). "Norepinephrine transporter gene variation modulates acute response to D-amphetamine." Biological Psychiatry 61(11): 1296-1305."'                                                                                 
`"Docherty, J. R. (2008). "Pharmacology of stimulants prohibited by the World Anti-Doping Agency (WADA)." British Journal Of Pharmacology 154(3): 606-622."'                                                                                   
`"Donaldson, C. D., et al. (2016). "Nonmedical use of prescription stimulants in college students: Attitudes, intentions, and vested interest." Addictive Behaviors 53: 101-107."'                                                             
`"Dos Santos, R. G., et al. (2011). "Autonomic, neuroendocrine, and immunological effects of ayahuasca: a comparative study with d-amphetamine." Journal Of Clinical Psychopharmacology 31(6): 717-726."'                                      
`"dos Santos, R. G., et al. (2014). "Salvinorin a and related compounds as therapeutic drugs for psychostimulant-related disorders." Current Drug Abuse Reviews 7(2): 128-132."'                                                               
`"Downey, L. A., et al. (2015). "Reduced memory skills and increased hair cortisol levels in recent Ecstasy/MDMA users: significant but independent neurocognitive and neurohormonal deficits." Human Psychopharmacology 30(3): 199-207."'     
`"Drabiak-Syed, K. (2011). "Reining In the pharmacological enhancement train: We Should remain vigilant about regulatory standards for prescribing controlled substances." Journal of Law, Medicine and Ethics 39(2): 272-279."'               
`"Drevets, W. C., et al. (2001). "Amphetamine-induced dopamine release in human ventral striatum correlates with euphoria." Biological Psychiatry 49(2): 81-96."'                                                                               
"D'Souza, M. S. and A. Markou (2010). Neural substrates of psychostimulant withdrawal-induced anhedonia. 3: 119-178."                                                                                                                           
end

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

22 Apr 2017, 06:32

After running the code from your first dataex to read in Var1 I ran

Code:

tempfile file1
save `file1'

After then running the code from your second dataex to read in Var2 I ran

Code:

rename Var2 Var1
merge 1:1 Var1 using `file1'
generate start = substr(Var1,1,20)
list start _merge, clean

which gave me the following report

Code:

                      start            _merge  
  1.   Alessi, S. M., et al   master only (1)  
  2.   Alexius, B., et al.        matched (3)  
  3.   Alibekov, I. M. (199       matched (3)  
  4.   Alla, V., et al. (20       matched (3)  
  5.   Allott, K. and J. Re       matched (3)  
  6.   Allott, K., et al. (       matched (3)  
  7.   D'Souza, M. S. and A       matched (3)  
  8.   Dlugos, A., et al. (   master only (1)  
  9.   Docherty, J. R. (200   master only (1)  
 10.   Donaldson, C. D., et   master only (1)  
 11.   Dos Santos, R. G., e       matched (3)  
 12.   Downey, L. A., et al       matched (3)  
 13.   Drabiak-Syed, K. (20       matched (3)  
 14.   Drevets, W. C., et a       matched (3)  
 15.   dos Santos, R. G., e       matched (3)  
 16.   Alsene, K. M., et al    using only (2)  
 17.   Ducours, J. L., et a    using only (2)  
 18.   Duffau, B. E., et al    using only (2)  
 19.   Dumont, G. J. H. and    using only (2)

Last edited by William Lisowski; 22 Apr 2017, 06:35.

Comment

Bob Green

Join Date: Oct 2015

Posts: 85
#3

22 Apr 2017, 16:02

William,

Many thanks.This looks simple and elegant. However, I became lost at your first step: "After running the code from your first dataex"

For ease I just named Var1, file1. I then tried to run the following code on two files,

version 14
cd "F:"
set type double
set more off
clear

use file1
tempfile file1

use Var2

rename Var2 Var1
merge 1:1 Var1 using `file1'
generate start = substr(Var1,1,20)
list start _merge, clean

Here is the output received:

use file1

. tempfile file1

.
.
. use Var2

.
. rename Var2 Var1

. merge 1:1 Var1 using `file1'
file C:\Users\Bob\AppData\Local\Temp\ST_00000001.tmp not found
r(601);

end of do-file

r(601);

I also tried, this without the temp file option

use file1
use Var2

merge 1:1 Var2 using `file1'
generate start = substr(Var2,1,20)
list start _merge, clean

use file1

. use Var2

.
. merge 1:1 Var2 using `file1'
invalid file specification
r(198);

end of do-file

r(198);

Any advice on what I am doing wrong would be helpful.In the real example, file 1 has 1509 rows and Var2 has 1469 (does substr(Var2,1,20) become substr(Var2, 1, 1469)?

Thanks again.

Bob
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29969
#4

22 Apr 2017, 16:49

When you issue the -tempfile file1- command, all that you are telling Stata to do is to assign the name file1 to a local macro that contains a name that could, eventually, be used for a temporary data file. It does not actually create the file. So later on when you try to -merge- with `file1', Stata finds no such file in existence. You have to -save `file1'- to actually create the file before you can make use of it.

But it's not clear that you actually need any intermediary temporary files. You can just merge the data directly

Code:

use first_file, clear // CONTAINS VAR1 rename var1 var2 merge 1:1 var2 using second_file // SECOND FILE HAS VARIABLE NAMED var2

I think William used a tempfile because he was starting not from an existing file but from the -dataex- output. So he had to create a file out of that. Notice that he does -save `file1'- immediately after declaring -tempfile file1-.
1 like
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#5

22 Apr 2017, 19:28

Clyde,

Thanks. I'm still struggling with this. Assuming first_file is file 1 above (Var1), why do I need to rename var1 to var2? If I merge using var1 renamed to var2, aren't I merging the same file, just renamed.

use first_file, clear // CONTAINS VAR1 rename var1 var2 merge 1:1 var2 using second_file // SECOND FILE HAS VARIABLE NAMED var2
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#6

23 Apr 2017, 04:34

I was finally able to perform the merge only after watching a Youtube video which employed the drop down menu. The key difference between the advice from William & Clyde above as well as what I have read, was that below I used the full path instead of just the file names.

use "D:\stata\Data\first_file.dta"
merge 1:1 _n using "D:\stata\Data\second_file.dta"
generate start = substr(var1,1,20)
list start _merge, clean

I don't know why I needed the full path, but it was the only way I could get the job done.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35468
#7

23 Apr 2017, 05:10

You would need the full path if you were working in a different directory or folder.

Code:

pwd

tells which you are working in if it is not evident from the GUI.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

23 Apr 2017, 05:33

Following up on Nick's post #7, in post #3 you tell us you have issued the command

Code:

cd "F:"

which sets your current (working) directory to F: and having done so,

Code:

use "first_file.dta"

is then shorthand for

Code:

use "F:\first_file.dta"

You needed

Code:

cd "D:\stata\Data"

in order for

Code:

use "first_file.dta"

to do what you wanted.

More to the point, your command

Code:

merge 1:1 _n using "D:\stata\Data\second_file.dta"

is not the command Clyde and I specified and did not give you the results you desire. Reading the output of

Code:

help merge

will inform you that your command paired the first observation in first_file with the first observation in second_file, regardless of whether Var1 in first_file was the same as Var2 in second_file. So if you were to actually look at the results you obtain from this code, you will see that you have both Var1 and Var2 and in general they are not equal to each other. All the _merge variable tells you is that it found a first observation in each file, a second observation in each file, and so forth.

My belief is that what you want to obtain eventually is a list of distinct citations from your Endnote data, with an indication for each of whether it came from the first search results in first_file or the second search results in second_file or from both. To use merge to accomplish this, the variables need to have the same name in both files, and you need to use them, not _n, as the key to be matched between the two files. The _merge variable which is automatically created will then tell you which file each appeared in.
1 like
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#9

23 Apr 2017, 15:06

Thanks Nick & William,

Due to frustration I created a new directory in D (to start afresh) but had been working from F. I agree it is confusing using the cd "F:" code, whch in some instances but not all may have caused problems.

Regarding the code "merge 1:1 _n" that William detected, this is a case of looking but not seeing the difference. However, when I tried the original code I'm back to errors:

use "D:\stata\Data\first_file.dta" , clear
merge 1:1 var1 using "D:\stata\Data\second_file.dta"
generate start = substr(var1,1,20)
list start _merge, clean

use "D:\stata\Data\first_file.dta"

. merge 1:1 var1 using "D:\stata\Data\second_file.dta"
variable var1 does not uniquely identify observations in the master data
r(459);

end of do-file

r(459);

. describe

Contains data from D:\stata\Data\first_file.dta
obs: 1,508
vars: 1 23 Apr 2017 18:06
size: 877,656
--------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------------------------------------------
var1 str582 %582s Original
--------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: var1

use second_file

. describe

Contains data from second_file.dta
obs: 1,469
vars: 1 23 Apr 2017 18:07
size: 854,958
--------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------------------------------------------
var1 str582 %582s New
--------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:

With "merge 1:1 _n" there were 39 references listed as "using only (2)" so I planned to check these manually to see what was detected. A quick check reveals _n detected the last 39 references.

I went back to William's & Clyde's posts and see they included a rename command, I tried this :

use "D:\stata\Data\first_file.dta", clear
rename var1 var2
merge 1:1 var2 using "D:\stata\Data\second_file.dta"
generate start = substr(var2,1,20)
list start _merge, clean

and receive this error - merge 1:1 var2 using "D:\stata\Data\second_file.dta"
variable var2 not found

There is something that I am not getting with all this. I re-read the help files but I still can't see why this doesn't work.

Thanks again
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29969
#10

23 Apr 2017, 16:29

In your example data, you the first file contained a variable called var1, and the second contained a variable called var2. Despite their different names, they represent the same things and you are trying to find which values occur in each data set separately, and which occur in both data sets. Stata is telling you that this description of affairs is not the reality of your data. It appears that your second data set does not contain a variable named var2. So the question is, what variable does it contain? (Note: Stata variable names are case sensitive, so if it contains VAR2, or Var2, etc., Stata will not find a match to var2.) The point is that to do a -merge- on data sets where you pair up on a common variable, the variable has to have the same name in both data sets. So you have to rename the variable in the first data set to have the same name as the corresponding variable in the second data set.

The other issue is that for a 1:1 merge to work, that same variable must uniquely identify observations in both data sets. So if that is not the case (and, apparently it is not based on the other error message you got), then you have to -duplicates drop var1- in the first data set, and perhaps -duplicates drop var2- in the second data set before trying to -merge- them.

Now let me anticipate a problem you may encounter with -duplicates drop ...- Stata may find that the observations that are duplicates on var1 may differ on other variables in the data set. (In your example data, there are no other variables, so this would not be an issue, but we've already established that your example data was an oversimplification.) If that is the case, Stata will not let you do -duplicates drop- because you would be losing data. So if you run into that problem, given the very limited purpose of this exercise, you can add the -force- option to -duplicates drop var1- and -duplicates drop var2-. I do not ordinarily recommend using -duplicates drop anything, force- because the loss of information can be problematic. But in this case, all we care about are the values of var1 and var2 themselves: for the present purpose every other variable is irrelevant. So losing the information about those other variables does no harm for this purpose. I'm emphasizing that so that you don't get into the habit of using the -force- option whenever you encounter a message that there is a problem. You should only use it in situations like this where a careful and thorough analysis leads to the conclusion that the loss of information is irrelevant.

So the overall approach assumes you have dataset1 with variable var1, and dataset2 with variable var2.

Code:

use dataset2, clear duplicates drop var2 // POSSIBLY WITH -force- OPTION tempfile two save `two' use dataset1, clear duplicates drop var1 // POSSIBLY WITH -force- OPTION rename var1 var2 merge 1:1 var2 using `two'

The -merge- variable will now by 1 for values that came only from dataset1, 2 for values that came only from datset2, and 3 for values that are found in both data sets.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

23 Apr 2017, 17:42

Clyde's done a yeoman's job on this in post #10.

I'd like to take a step back: depending on your objectives, using merge may be making more trouble for you than it's worth.

What is it you want from this process? In post #8 I speculated on what you wanted, but you haven't confirmed or denied that. Here are three possibilities:
A list of distinct citations (that is, at most one copy of each) from your Endnote queries, with an indication for each of whether it came from the first search results in first_file or the second search results in second_file or from both.

A list of distinct citations from your Endnote queries, and a count of how many times each citation occurs in each set of query results.

A list of distinct citations from your Endnote queries, and you don't care which query it came from, nor how many times it was found by each query. And perhaps you actually have a third and a fourth and a fifth Endnote query you haven't told us about yet, and you believe you can generalize what you get from this limited exercise to the larger problem.

In the first case, proceed along the lines Clyde indicated. In the second or third cases, tell us - because there are simpler approaches using the append command. Like the approach using merge, it will depend on having your Endnote query response variable being named identically in both datasets. I realize that seems unintuitive and confusing to you, but in another thread you established that you are an infrequent user of Stata whose expertise is limited to the tasks you need to do routinely, so your intuition in this new area is not to be trusted. If possible, you should consider rerunning the programs that created first_file and second_file from the Endnote query data, and change them to create new files with the variable name the same in both cases. If you want to be able to tell from the data which query the result came from, add

Code:

generate querynumber = 1

to the code for first_file and

Code:

generate querynumber = 2

to the code for the second file.

On another issue, to present data, code, and results readably, please copy them from the Results window or your log file into a code block in the Forum editor, as explained in the Statalist FAQ linked to at the top of the page. For example, the following:

[code]
// sample code
sysuse auto, clear
describe
[/code]

will be presented in the post as the following:

Code:

// sample code sysuse auto, clear describe

You will note that the syntax is just like what the dataex provided in the material used in your example in post #1.
1 like
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#12

23 Apr 2017, 21:52

William & Clyde,

Many thanks for your replies. I'll test out the code tonight. In relation to William's question, I want to ascertain what citations are in search 1 and not search 2; and vice versa. The number of discrepant citations is of value, however, it is the specific citations that I want to see.
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#13

24 Apr 2017, 07:52

Using one file, I trialed Clyde's suggestions to run drop duplicates.

Code:

clear use "D:\stata\Data\first_file - Copy.dta" duplicates drop var1, force

I had to use the force option, because I received a message: "varlist not allowed"

When

Code:

duplicates

is used without the

Code:

force

option, I read it it is possible to obtain the list of duplicates. When force is used is it possible to obtain a list of duplicates?

If this is possible, I'll attempt what William suggested: "If possible, you should consider rerunning the programs that created first_file and second_file from the Endnote query data, and change them to create new files with the variable name the same in both cases." Does this output below suggest otherwise?

Last edited by Bob Green; 24 Apr 2017, 07:55.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#14

24 Apr 2017, 08:38

When force is used is it possible to obtain a list of duplicates?

Just use

Code:

duplicates list var1

to obtain the list of observations with duplicated values of var1. The

Code:

force

option does not affect the determination of which values are duplicated. As the output of

Code:

help duplicates

tells you, the

Code:

force

option is required to drop observations when you specify a varlist, as you do with

Code:

duplicates drop var1

but the purpose of the option is only to remind you that you may be losing information, because variables other than var1 may differ for observations with the same value of var1. Other than that, the

Code:

force

option has no effect on the command.

Added: because your var1 is a str582 variable, you may find using

Code:

duplicates list var1, string(20)

to limit the display of var1 to the first 20 characters to be helpful.

Last edited by William Lisowski; 24 Apr 2017, 08:47.
1 like
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#15

24 Apr 2017, 15:57

Thanks William & Clyde,

The following code now runs without error

Code:

version 14 cd "D:\stata\Data\" set type double set more off clear use second_file, clear duplicates drop var1, force tempfile two save `two' describe use first_file, clear duplicates drop var1, force merge 1:1 var1 using `two' describe

The results:

. use first_file, clear

. duplicates drop var1, force

Duplicates in terms of var1

(0 observations are duplicates)

. * rename var1 var2
. merge 1:1 var1 using `two'

Result # of obs.
-----------------------------------------
not matched 24
from master 8 (_merge==1)
from using 16 (_merge==2)

matched 1,453 (_merge==3)
-----------------------------------------

. describe

Contains data from first_file.dta
obs: 1,477
vars: 2 24 Apr 2017 23:00
size: 861,091
--------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------------------------------------------
var1 str582 %582s
_merge byte %23.0g _merge
--------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.

.
end of do-file

I will now go back to the searches and double check.The issue of variables and files confounded me. Previously, I have done something similar in R, where I simply had the two variables in a single file and compared the two variables (via setdiff).

Thanks again.
Comment

Announcement

Identifying matches between two string variables - with that may not be in the same row

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment