Checking whether two variables share characteristics across datasets

Valentine Laurent

Join Date: Oct 2023

Posts: 7
#1

Checking whether two variables share characteristics across datasets

11 Oct 2023, 06:21

Hello Statalisters,

I am looking for a particular program (let's call it searchvar) that may help resolve some data management issues. Basically what I would like to do is something like:

searchvar (varname) using "filename",

and Stata would search a variable called varname in both the dataset in memory and the using dataset and report whether they are the same, whether there is any difference in terms of metadata (# of categories, values, format, labels...).

I would also like to do the opposite action, i.e. for two variables that may not share the same name, but the exact same characteristics for example if two variables in two datasets, say abc1 and xyz2, both have the same categories 1 = "Apple", 2 = "Orange", then Stata would list me abc1 and xyz2.

I don't know if I make sense, but if such a program would exist, it would save me from countless hours of work. If it doesn't exist, do you think such a program could be easily coded? What could be potential challenges to such a program?

Best regards,
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3850
#2

11 Oct 2023, 11:09

Originally posted by Valentine Laurent View Post

I am looking for a particular program [...] that [...] would search a variable called varname in both the dataset in memory and the using dataset and report whether they are the same, whether there is any difference in terms of metadata (# of categories, values, format, labels...).

I would also like to do the opposite action, i.e. for two variables that may not share the same name, but the exact same characteristics for example if two variables in two datasets, say abc1 and xyz2, both have the same categories 1 = "Apple", 2 = "Orange", then Stata would list me abc1 and xyz2.

If it doesn't exist, do you think such a program could be easily coded?

Such a program probably does not exist yet. Similar programs, e.g., cf or compare, typically focus on the data, i.e., values and their distribution, not on metadata. Some programs, e.g., merge and append do report mismatching storage types.

Could such a program be written? Most certainly yes. Easily? That depends on various factors, fluency in Stata (and perhaps Mata) being one of them.

Originally posted by Valentine Laurent View Post

What could be potential challenges to such a program?

First and foremost, you need a clear and exhaustive definition of what exactly you want the program to do. The description in #1 is not even close.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3850

11 Oct 2023, 13:22

Here is a very basic program that compares two variables with the same name

Code:

program compare_variable
    
    version 18
    
    syntax varname using
    
    preserve
    
    describe `varlist' , replace clear
    
    list name type format vallab varlab , noobs clean
    
    quietly use `varlist' in 1 `using' , clear
    
    describe `varlist' , replace clear
    
    list name type format vallab varlab , noobs noheader clean
    
end

Here is an example:

Code:

. webuse autom
(1978 automobile data)

. compare_variable foreign using https://www.stata-press.com/data/r18/auto

       name   type   format   vallab     varlab  
    foreign   byte   %22.0g   origin   Car type  
    foreign   byte    %8.0g   origin   Car origin

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

11 Oct 2023, 13:50

Although not a direct solution to O.P.'s problem, I will call attention to Mark Chatfield's -precombine- command, available at SSC. It does part of what is asked, although it will not single out individual variables but rather will do this kind of comparison to all variables in a list of data sets. It goes deeper than daniel klein 's nice solution in #3. For instance, -precombine- will not only determine whether the value labels have the same name, but it will also check whether they contain the same labeling information. That can be critical because attempting to apply the same code to two variables that have the same values labeled differently can lead to chaos.

So, part of what O.P. wants might be obtained by creating new data sets containing only the variables she wants to compare (so she doesn't drown in irrelevant output about other variables) and apply -precombine- to them. But, even this, is only a partial solution. It does not address the "opposite action" part of her questionnaire.
Comment

Announcement

Checking whether two variables share characteristics across datasets

Comment

Comment

Comment