Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding intersection of more than two macros

    Hi everyone,

    I am new to the forum so please excuse possible inaccuracies. I am construction a panel data set (time period: 2003-2016) out of 4 .dta files per year by merging them via a unique ID. Each of these files contains between 20 and 60 variables, which might be overlapping across data sets. For this purpose, I would like to program a loop that checks the intersections across data sets in every year and puts the variable names of these intersections into an excel file ideally (or at least give me an output in Stata). I thought of an approach similar to the one explained in an earlier post: http://www.statalist.org/forums/foru...ing-conflict-s

    My code would conclude in having:

    Code:
    global common: list `varlists_all'
    where the macro after lists contains the names of locals with an "&" in between. These locals, in turn, contain the names of variables I need to check.
    However, I now saw that macro lists is only able to process two locals or globals containing variable names. My "wrong approach" looks like this:

    Code:
    foreach i of global years {
                global files: dir "C:\Users\Carlo\Dropbox\Master\4. Semester\Masterarbeit\Data and statistical work\raw_data\DHS wave `i'" files "*.dta"
                cd "C:\Users\Carlo\Dropbox\Master\4. Semester\Masterarbeit\Data and statistical work\raw_data\DHS wave `i'"
    
                * Drop duplicates if present and alo drop variables that are not needed and that are conflicting in some datasets (see excel worshet for that)
                local varlists_all
                foreach file of global files {
                        use "`file'" 
                        duplicates drop
                        qui ds
                    
                    *copy all variable names of current file j into a macro
                    if  substr("`file'",1,3)!="wei" {
                            local suffix=substr("`file'",1,3)
                            local vlist_`suffix'_`i'=r(varlist)
                            local varlists_all `varlists_all' vlist_`suffix'_`i' &
                            di "`varlists_all'"    
                            di "end of file loop" 
                            
                    }
                    clear
                
                 }
                    * Check in the global common which variables are common to the data set
                    * Eliminate last "&"- sign in the list
                    local dropper &_todrop
                    local varlists_all `varlists_all'_todrop
                    local varlists_all: list varlists_all-dropper
                    di "`varlists_all'"    
                    
                    * Problem: one can only check two files at once!!
                    di "global common: list `varlists_all'"
                    * HERE is the Problem:
                    *global common: list `varlists_all'
                    
                    * Extract macro into excel sheet
                    *cd "$maindirectory"
                    *putexcel A1="Variable" A2="$common" using "20170601_conflicting_variables, sheet("`i'") modify    
                    *cd "C:\Users\Carlo\Dropbox\Master\4. Semester\Masterarbeit\Data and statistical work\raw_data\DHS wave `i'"            
            
        
        }
    There is probably a much easier approach to achieve this, so if you have any idea, I would be very grateful for help!

    Thank you very much in advance.

    Best regards,

    Carlo

  • #2
    Your code has several levels of macros nested within macros, which makes it somewhat difficult to follow for somebody who isn't really clear where all of this is going. So let me abstract away from that and assume that you have a bunch of local macros, each of which is a list of variables in a data set. Each local macro's name reflects the name of the data set, and you want to end up with a local macro containing the names of variables that are common to all of those data sets.

    So specifically, we have a bunch of suffixes that come from the filenames, and a bunch of years. I will assume that there is a local macro, called suffixes, that contains all the suffixes, and another local macro years, which contains all the years. Then you can do this at the end:

    Code:
    local first_list vlist_`:word 1 of `suffixes''_`:word 1 of `years''
    local common ``first_list''
    foreach s of local suffixes {
        foreach y of local years {
            local common: list common & vlist_`s'_`y'
        }
    }
    Note: this is completely untested, and this kind of code is particularly vulnerable to typos, not to mention that macros within macros can be confusing even without typos. But I think it has the gist of it. You need to initialize common with the contents of one of the variable lists. Then you have to iterate over all of the suffixes and years and intersect common with the next name of the variable list.

    I think this is more or less in line with the structure of your code. I hope you can adapt it without too much difficulty.

    Added: Note, by the way, that I have created common as a local macro, not a global. I will spare you my long rant about why global macros are dangerous and should almost never be used. Suffice it to say that without a compelling reason to use a global macro here, I would always use a local macro.
    Last edited by Clyde Schechter; 01 Jun 2017, 08:22.

    Comment


    • #3
      Thank you very much, that was very valuable advice!!!

      Comment


      • #4
        I found a much simpler way that solves the issue of checking several files for common variables using the command cfvars. It provides you with a log file containing the names of common variables across all file combinations within a folder across all years. Following Clyde's advice, one might replace global macros by local macros!
        Code:
        global years 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016    
        
        log using "yourpathname"\logfile_conflicts", replace
            foreach i of global years {
                global files: dir "yourpathname\DHS wave `i'" files "*.dta"
                cd "yourpathname\DHS wave `i'"
                
                foreach firstfile of global files {
                
                    foreach secondfile of global files {
            
                        if "`firstfile'"!="`secondfile'" {
                            qui cfvars `firstfile' `secondfile'
                            di ""
                            di "Variables present in `firstfile' and `secondfile': `r(both)' "                
                            di ""
                        }
                    }
                
                }
            
            
            }
            log close
        Last edited by Carlo Parafioriti; 02 Jun 2017, 07:02.

        Comment


        • #5
          Here's a solution that requires filelist (from SSC). To test it, here's some code that will create 14 datasets in a subdirectory called "datasets" within Stata's current directory. Each dataset has a varying number of common variables and at least one unique variable (year`i'):

          Code:
          clear
          set seed 32134
          cap mkdir datasets
          forvalues i = 2003/2016 {
              sysuse auto, clear
              gen year`i' = `i'
              foreach v of varlist * {
                  if uniform() < .5 drop `v'
              }
              forvalues j = 1/200 {
                  if uniform() < .5 gen data_of_interest_`j' = runiform()
              }
              save "datasets/DHS wave `i'.dta", replace
          }
          With filelist, you create a dataset with one observation per file. You loop over each observation and collect the variables in each dataset using describe. Next you split the string variable that contains the list of variables per dataset and reshape long. Then it's just a matter of sorting the observations per variable name to find out where they appear in the datasets. If there is only one observation per group, then the variable occurs in only one dataset.

          Code:
          * make a list of Stata datasets within the "datasets" dir
          filelist , dir("datasets") pattern("*.dta")
          
          * loop over the list to collect variable names
          gen vlist = ""
          local n = _N
          forvalues i=1/`n' {
              local f = dirname[`i'] + "/" + filename[`i']
              qui describe using "`f'",  varlist
              replace vlist = "`r(varlist)'" in `i'
          }
          
          * split the variable list and reshape to long
          split vlist, gen(var)
          drop vlist
          reshape long var, i(dirname filename fsize)
          drop if mi(var)
          
          * group by var to identify common dataset sources
          bysort var (dirname filename): gen N = _N
          tab N
          
          list if N == 1, sep(0)
          and the final listing:
          Code:
          . tab N
          
                    N |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    1 |          7        0.47        0.47
                    2 |          2        0.13        0.61
                    3 |          9        0.61        1.21
                    4 |         32        2.16        3.37
                    5 |        175       11.79       15.16
                    6 |        264       17.79       32.95
                    7 |        280       18.87       51.82
                    8 |        320       21.56       73.38
                    9 |        216       14.56       87.94
                   10 |        100        6.74       94.68
                   11 |         55        3.71       98.38
                   12 |         24        1.62      100.00
          ------------+-----------------------------------
                Total |      1,484      100.00
          
          . 
          . list if N == 1, sep(0)
          
                +------------------------------------------------------------+
                | dirname    filename              fsize   _j        var   N |
                |------------------------------------------------------------|
          1478. | datasets   DHS wave 2006.dta   101,801    5   year2006   1 |
          1479. | datasets   DHS wave 2007.dta   100,817    8   year2007   1 |
          1480. | datasets   DHS wave 2008.dta   104,231    9   year2008   1 |
          1481. | datasets   DHS wave 2009.dta   102,803    7   year2009   1 |
          1482. | datasets   DHS wave 2010.dta    97,875    8   year2010   1 |
          1483. | datasets   DHS wave 2011.dta    99,747    7   year2011   1 |
          1484. | datasets   DHS wave 2015.dta   100,569    6   year2015   1 |
                +------------------------------------------------------------+

          Comment

          Working...
          X