Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trying to create a table of descriptive stats with both numeric and categorical variables, need advice

    Hi, I'm new to data management / validation in Stata and am coming from mostly SQL/SAS.

    Trying to do:

    I've been asked to create a table to help validate a set of datasets against a provided dictionary, so I want descriptions of variables in the datasets.
    The desired table headers are:
    folder filename position field datatype isnumeric countmissing countzero totalobs valmin valmax charmin charmax datamin datemax

    Folder and filename are path related, as I have folders for several regions, each with a set of files named the same for each region. These are here to tell me which file the stats are from.
    Field, datatype, isnumeric, and position are outputs from using -describe-. I wanted to list varnames so that I can later compare against the datatypes from the dictionary.
    Countmissing and countzero are what they sound like, though zero would apply only to fields with isnumeric flags of 1.
    Totalobs is total nonmissing observations in that dataset for given folder/filename/field.
    Valmin and valmax are min and max of isnumeric flagged fields, missing for isnumeric==0 fields.
    Charmin and charmax are the min and max lengths of the string variables, missing for nonstring fields.
    Datemin and datemax are the min and max date values for date variables, missing for nondate fields.

    What I've done:

    I've looped over all the folders and files using local macros of folders and filenames saving - describe, replace clear - results as datasets, adding generated folder and filename columns along the way, then appending each of these to one main dataset of overall -describe- results.

    I created a prototype Excel sheet of what I want the final table to contain, but as I've had difficulty pulling all this info together in my do file, I populated the descriptive stats manually with results from interactively using various commands (min, max, strlen, count var if == 0, count if missing()). I did this only for one region/filename, but this needs to be automated and run on the rest of and future versions of the datasets.

    Haven't done:

    Used user created ado files. I'm open to any suggestions here, I've been trying to keep the code to just what comes with Stata 15 but at this point I just need results.

    What I'm looking for:

    General thoughts on how to better approach what I'm after in a Stata way, I'm willing to go any other direction because I'm getting nowhere fast. I'm looking at tabstat, but it doesn't seem to provide descriptions of nonnumeric data. I'm not stuck on using the results from -describe- as I have, if there's a better way to populate that information into a dataset/table I'm all for it.

    Ideally I'd like to be able to loop through the folders and files, populate the described summary table from above, appending results along the way into one main dataset. I think there's got to be some simple way to do this that is escaping me and my google/documentation searches.

    Is there a way to say "output a dataset where field = varname, position = columnnumber(varname), datatype = datatype(varname), isnumeric = if(datatype(varname) like str,0,1), valmin = if(datatype(varname) like str,".",min(varname))" etc?

    Any advice is welcome, including asking for more information or simply advice on how best to get any part of what I'm looking for. System memory generally shouldn't be a concern for this.

    Running Stata 15 on Windows 10, though the code shouldn't be too specific to Windows environment in case we pass it on to others on Linux.

    Thanks,

    Wade Petty

  • #2
    I can get you started. There are a lot of complications that can potentially arise in the greatest generality, and I don't have the time to nail them all down. The following code will work in most circumstances:

    Code:
    clear*
    filelist, pattern(*.dta)
    
    capture program drop one_file
    program define one_file
        local dirname = dirname[1]
        local filename = filename[1]
        use `"`dirname'/`filename'"', clear
        describe, replace
        gen dirname = `"`dirname'"'
        gen filename = `"`filename'"'
        exit
    end
    
    runby one_file, by(dirname filename) verbose
    
    capture program drop supplementary
    program define supplementary
        tempfile copy
        save `copy'
        local dirname = dirname[1]
        local filename = filename[1]
        local varname = name[1]
        use `varname' using `"`dirname'/`filename'"', clear
        gen dirname = `"`dirname'"'
        gen filename = `"`filename'"'
        if inlist(`"`varname'"', "name", "dirname", "filename", "isnumeric", ///
            "countmissing" "countzero", "totalobs", "valmin", "valmax") ///
            | inlist(`"`varname'"', "charmin", "charmax") {
            rename `varname' varname
            gen name = "`varname'"
            local varname varname
        }
        else {
            gen name = `"`varname'"'
        }
        count if missing(`varname')
        gen countmissing = r(N)
        count if !missing(`varname')
        gen totalobs = r(N)
        capture confirm numeric var `varname'
        if c(rc) == 0 {
            gen isnumeric =  1
            count if `varname' == 0
            gen countzero = r(N)
            summ `varname', meanonly
            gen valmin = `r(min)'
            gen valmax = `r(max)'
            gen charmin = .
            gen charmax = .
        }
        else {
            gen isnumeric = 0
            gen countzero = .
            gen valmin = .
            gen valmax = .
            gen length = length(`varname')
            summ length, meanonly
            gen charmin = `r(min)'
            gen charmax = `r(max)'
        }
        keep dirname filename name isnumeric countmissing countzero totalobs ///
            valmin valmax charmin charmax
        keep in 1
        merge 1:1 dirname filename name using `copy', assert(match) nogenerate
        exit
    end
    
    runby supplementary, by(dirname filename name) verbose
    As you can see, there are potential complications here if one of the data sets has a variable whose name is the same as one of the variables you are attempting to create here. I danced around that a bit in the code. Another pitfall here is that if you have any files that don't actually contain any observations, you will generate an error here (although the code will keep running and just skip over that situation). I think the code may also break if you have a numeric variable with all of its values missing. Again, it will just skip over that case and generate no output for it. As, I say, it would be a large undertaking to test this in full generality for all the exceptional conditions that can arise.

    Notice, also, that I do not deal with date variables. Stata doesn't actually have a separate date variable type. In Stata dates are just numeric variables. There is no general way to identify which variables are intended to be dates and which are not. In general, such variables will have a %td or %tm or %tc, etc. display format. And you could look for that. But people are not always consistent in doing this, and I have also seen some data sets where a variable that is not supposed to be a date nevertheless had date display formatting applied. (Error? Something unusual in mind?) So the code above just treats date variables as the numeric variables they are.

    To use this code you must install the -runby- program, by Robert Picard and me, from SSC. The rest of the code is native Stata. -runby- is a way to loop over values of variables more efficiently (in large data sets) and more simply than using nested -foreach- loops.

    Added: You also need to get -filelist- from SSC; this one is also by Robert Picard.
    Last edited by Clyde Schechter; 04 Apr 2018, 11:09.

    Comment


    • #3
      Clyde,

      Thanks so much for your detailed response! I'll work on this today and let you know how things turn out.

      Comment


      • #4
        Clyde, your code worked very well and is much appreciated.

        I did run into a numeric variable that had no observations and the log was showing 'invalid syntax' because of attempting to use `r(min)' when there were no observations for the summ. I accounted for that by taking regular summ and conditioning on `r(N)' as follows:

        Code:
           
                if c(rc) == 0 {  //numeric variables
                gen isnumeric =  1
                count if `varname' == 0
                gen countzero = r(N)
                summ `varname'
                if `r(N)'==0 {  //case where a numeric variable is completely missing
                    gen valmin = .
                    gen valmax = .
                }
                else {
                    gen valmin = `r(min)'
                    gen valmax = `r(max)'
                }
                gen charmin = .
                gen charmax = .
            }
        I'm guessing using the -summ `varname', meanonly- is to save time, but I needed a simple way to test for all missing on numeric variables.

        Comment


        • #5
          The -meanonly- option is, indeed, used to speed things up. But it'svery strangely named. It sounds like it only calculates the mean. But, in fact, it calculates all of the statistics you ordinarily get with -summarize- except the standard deviation. The standard deviation takes extra time because it requires two passes through the data and also taking a square root. But you can still use -meanonly- when you need the min or max or N: they come along despite the name!

          Comment


          • #6
            Figured I would add the final code I've been running to attempt to deal with the dates in case someone comes along later needing something similar. There's also a statement that tries to handle whitespace as some of the observations seem to be a single or several spaces and I'm converting them to missing here. On initial tests of a handful of datasets, the code appears to be producing the table I want. The only issue now is that it's running through maybe 2000 datasets, some of them are fairly large, and the program is taking literally days to run. In looking at this version, are there any glaring efficiency gains you can see that I could gain by doing anything differently? Thanks for any thoughts on this.

            Code:
            //get list of all datasets to go through
            filelist, dir(`sourceDir') pattern(*.dta)
            //drop merged or m datasets created after the fact for analysis
            drop if strpos(filename,"_m")>0 | strpos(filename,"merged")>0 | strpos(filename,"clean")>0
            
            //create program to get -describe- info, and convert dates where named date
            capture program drop one_file
            program define one_file
                local dirname = dirname[1]
                local filename = filename[1]
                use `"`dirname'/`filename'"', clear
                foreach var of varlist _all {
                    capture confirm numeric var `var'
                    if c(rc)>0 {
                        if substr("`var'",-4,4)=="date" {
                            gen `var'datevar = date(`var',"YMD")
                            format `var'datevar %td
                            drop `var'
                            rename `var'datevar `var'
                        }
                    }
                }
                describe, replace clear
                gen dirname = `"`dirname'"'
                gen filename = `"`filename'"'
                exit
            end
            
            //runby will run the program agains each dataset in the list from above
            runby one_file, by(dirname filename) verbose
            
            timer list 1
            
            //create program to get other descriptors and merge with -describe- info
            capture program drop supplementary
            program define supplementary
                tempfile copy
                save `copy'
                local dirname = dirname[1]
                local filename = filename[1]
                local varname = name[1]
                use `"`dirname'/`filename'"', clear
                foreach var of varlist _all {
                    capture confirm numeric var `var'
                    if c(rc)>0 {
                        if substr("`var'",-4,4)=="date" {
                            gen `var'datevar = date(`var',"YMD")
                            format `var'datevar %td
                            drop `var'
                            rename `var'datevar `var'
                        }
                    }
                }
                tempfile withdates
                save `withdates'
                use `varname' using `withdates', clear
                gen dirname = `"`dirname'"'
                gen filename = `"`filename'"'
                if inlist(`"`varname'"', "name", "dirname", "filename", "isnumeric", ///
                    "countmissing" "countzero", "totalobs", "valmin", "valmax") ///
                    | inlist(`"`varname'"', "charmin", "charmax", "datemin", "datemax") {
                    rename `varname' varname
                    gen name = "`varname'"
                    local varname varname
                }
                else {
                    gen name = `"`varname'"'
                }
                display dirname  //so I can see how far along the progress is
                display filename
                display name
                display "`varname'"
                capture confirm numeric var `varname'
                if c(rc)> 0 {
                    replace `varname' = "" if trim(`varname') == ""
                }
                count if missing(`varname')
                gen countmissing = r(N)
                count if !missing(`varname')
                gen totalobs = r(N)
                capture confirm numeric var `varname'
                if c(rc) == 0 {
                    gen isnumeric =  1
                    count if `varname' == 0
                    gen countzero = r(N)
                    summ `varname', meanonly
                    if `r(N)'==0 {
                        gen valmin = .
                        gen valmax = .
                    }
                    else {
                        gen valmin = `r(min)'
                        gen valmax = `r(max)'
                    }
                    gen charmin = .
                    gen charmax = .
                    if substr("`varname'",-4,4)=="date" & `r(N)' > 0 {
                        gen datemin = `r(min)'
                        //format datemin %td
                        gen datemax = `r(max)'
                        //format datemax %td
                    }
                    else {
                        gen datemin = .
                        gen datemax = .
                    }
                }
                else {
                    gen isnumeric = 0
                    gen countzero = .
                    gen valmin = .
                    gen valmax = .
                    gen length = length(`varname')
                    summ length, meanonly
                    gen charmin = `r(min)'
                    gen charmax = `r(max)'
                    gen datemin = .
                    gen datemax = .        
                }
                keep dirname filename name isnumeric countmissing countzero totalobs ///
                    valmin valmax charmin charmax datemin datemax
                keep in 1
                merge 1:1 dirname filename name using `copy', assert(match) nogenerate
                exit
            end
            
            runby supplementary, by(dirname filename name) verbose
            
            timer list 1
            
            order position, before(name)
            order type, after(name)
            order format, after(type)
            drop varlab
            drop vallab
            format datemin %td
            format datemax %td

            Comment

            Working...
            X