Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Checking whether two variables share characteristics across datasets

    Hello Statalisters,

    I am looking for a particular program (let's call it searchvar) that may help resolve some data management issues. Basically what I would like to do is something like:

    searchvar (varname) using "filename",

    and Stata would search a variable called varname in both the dataset in memory and the using dataset and report whether they are the same, whether there is any difference in terms of metadata (# of categories, values, format, labels...).

    I would also like to do the opposite action, i.e. for two variables that may not share the same name, but the exact same characteristics for example if two variables in two datasets, say abc1 and xyz2, both have the same categories 1 = "Apple", 2 = "Orange", then Stata would list me abc1 and xyz2.

    I don't know if I make sense, but if such a program would exist, it would save me from countless hours of work. If it doesn't exist, do you think such a program could be easily coded? What could be potential challenges to such a program?

    Best regards,

  • #2
    Originally posted by Valentine Laurent View Post
    I am looking for a particular program [...] that [...] would search a variable called varname in both the dataset in memory and the using dataset and report whether they are the same, whether there is any difference in terms of metadata (# of categories, values, format, labels...).

    I would also like to do the opposite action, i.e. for two variables that may not share the same name, but the exact same characteristics for example if two variables in two datasets, say abc1 and xyz2, both have the same categories 1 = "Apple", 2 = "Orange", then Stata would list me abc1 and xyz2.

    If it doesn't exist, do you think such a program could be easily coded?
    Such a program probably does not exist yet. Similar programs, e.g., cf or compare, typically focus on the data, i.e., values and their distribution, not on metadata. Some programs, e.g., merge and append do report mismatching storage types.

    Could such a program be written? Most certainly yes. Easily? That depends on various factors, fluency in Stata (and perhaps Mata) being one of them.


    Originally posted by Valentine Laurent View Post
    What could be potential challenges to such a program?
    First and foremost, you need a clear and exhaustive definition of what exactly you want the program to do. The description in #1 is not even close.

    Comment


    • #3
      Here is a very basic program that compares two variables with the same name

      Code:
      program compare_variable
          
          version 18
          
          syntax varname using
          
          preserve
          
          describe `varlist' , replace clear
          
          list name type format vallab varlab , noobs clean
          
          quietly use `varlist' in 1 `using' , clear
          
          describe `varlist' , replace clear
          
          list name type format vallab varlab , noobs noheader clean
          
      end

      Here is an example:

      Code:
      . webuse autom
      (1978 automobile data)
      
      . compare_variable foreign using https://www.stata-press.com/data/r18/auto
      
             name   type   format   vallab     varlab  
          foreign   byte   %22.0g   origin   Car type  
          foreign   byte    %8.0g   origin   Car origin

      Comment


      • #4
        Although not a direct solution to O.P.'s problem, I will call attention to Mark Chatfield's -precombine- command, available at SSC. It does part of what is asked, although it will not single out individual variables but rather will do this kind of comparison to all variables in a list of data sets. It goes deeper than daniel klein 's nice solution in #3. For instance, -precombine- will not only determine whether the value labels have the same name, but it will also check whether they contain the same labeling information. That can be critical because attempting to apply the same code to two variables that have the same values labeled differently can lead to chaos.

        So, part of what O.P. wants might be obtained by creating new data sets containing only the variables she wants to compare (so she doesn't drown in irrelevant output about other variables) and apply -precombine- to them. But, even this, is only a partial solution. It does not address the "opposite action" part of her questionnaire.

        Comment

        Working...
        X