Trying to create a table of descriptive stats with both numeric and categorical variables, need advice

wade petty

Join Date: Mar 2018

Posts: 5
#1

Trying to create a table of descriptive stats with both numeric and categorical variables, need advice

04 Apr 2018, 08:50

Hi, I'm new to data management / validation in Stata and am coming from mostly SQL/SAS.

Trying to do:

I've been asked to create a table to help validate a set of datasets against a provided dictionary, so I want descriptions of variables in the datasets.
The desired table headers are:
folder filename position field datatype isnumeric countmissing countzero totalobs valmin valmax charmin charmax datamin datemax

Folder and filename are path related, as I have folders for several regions, each with a set of files named the same for each region. These are here to tell me which file the stats are from.
Field, datatype, isnumeric, and position are outputs from using -describe-. I wanted to list varnames so that I can later compare against the datatypes from the dictionary.
Countmissing and countzero are what they sound like, though zero would apply only to fields with isnumeric flags of 1.
Totalobs is total nonmissing observations in that dataset for given folder/filename/field.
Valmin and valmax are min and max of isnumeric flagged fields, missing for isnumeric==0 fields.
Charmin and charmax are the min and max lengths of the string variables, missing for nonstring fields.
Datemin and datemax are the min and max date values for date variables, missing for nondate fields.

What I've done:

I've looped over all the folders and files using local macros of folders and filenames saving - describe, replace clear - results as datasets, adding generated folder and filename columns along the way, then appending each of these to one main dataset of overall -describe- results.

I created a prototype Excel sheet of what I want the final table to contain, but as I've had difficulty pulling all this info together in my do file, I populated the descriptive stats manually with results from interactively using various commands (min, max, strlen, count var if == 0, count if missing()). I did this only for one region/filename, but this needs to be automated and run on the rest of and future versions of the datasets.

Haven't done:

Used user created ado files. I'm open to any suggestions here, I've been trying to keep the code to just what comes with Stata 15 but at this point I just need results.

What I'm looking for:

General thoughts on how to better approach what I'm after in a Stata way, I'm willing to go any other direction because I'm getting nowhere fast. I'm looking at tabstat, but it doesn't seem to provide descriptions of nonnumeric data. I'm not stuck on using the results from -describe- as I have, if there's a better way to populate that information into a dataset/table I'm all for it.

Ideally I'd like to be able to loop through the folders and files, populate the described summary table from above, appending results along the way into one main dataset. I think there's got to be some simple way to do this that is escaping me and my google/documentation searches.

Is there a way to say "output a dataset where field = varname, position = columnnumber(varname), datatype = datatype(varname), isnumeric = if(datatype(varname) like str,0,1), valmin = if(datatype(varname) like str,".",min(varname))" etc?

Any advice is welcome, including asking for more information or simply advice on how best to get any part of what I'm looking for. System memory generally shouldn't be a concern for this.

Running Stata 15 on Windows 10, though the code shouldn't be too specific to Windows environment in case we pass it on to others on Linux.

Thanks,

Wade Petty
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30085
#2

04 Apr 2018, 10:45

I can get you started. There are a lot of complications that can potentially arise in the greatest generality, and I don't have the time to nail them all down. The following code will work in most circumstances:

Code:

clear* filelist, pattern(*.dta) capture program drop one_file program define one_file local dirname = dirname[1] local filename = filename[1] use `"`dirname'/`filename'"', clear describe, replace gen dirname = `"`dirname'"' gen filename = `"`filename'"' exit end runby one_file, by(dirname filename) verbose capture program drop supplementary program define supplementary tempfile copy save `copy' local dirname = dirname[1] local filename = filename[1] local varname = name[1] use `varname' using `"`dirname'/`filename'"', clear gen dirname = `"`dirname'"' gen filename = `"`filename'"' if inlist(`"`varname'"', "name", "dirname", "filename", "isnumeric", /// "countmissing" "countzero", "totalobs", "valmin", "valmax") /// | inlist(`"`varname'"', "charmin", "charmax") { rename `varname' varname gen name = "`varname'" local varname varname } else { gen name = `"`varname'"' } count if missing(`varname') gen countmissing = r(N) count if !missing(`varname') gen totalobs = r(N) capture confirm numeric var `varname' if c(rc) == 0 { gen isnumeric = 1 count if `varname' == 0 gen countzero = r(N) summ `varname', meanonly gen valmin = `r(min)' gen valmax = `r(max)' gen charmin = . gen charmax = . } else { gen isnumeric = 0 gen countzero = . gen valmin = . gen valmax = . gen length = length(`varname') summ length, meanonly gen charmin = `r(min)' gen charmax = `r(max)' } keep dirname filename name isnumeric countmissing countzero totalobs /// valmin valmax charmin charmax keep in 1 merge 1:1 dirname filename name using `copy', assert(match) nogenerate exit end runby supplementary, by(dirname filename name) verbose

As you can see, there are potential complications here if one of the data sets has a variable whose name is the same as one of the variables you are attempting to create here. I danced around that a bit in the code. Another pitfall here is that if you have any files that don't actually contain any observations, you will generate an error here (although the code will keep running and just skip over that situation). I think the code may also break if you have a numeric variable with all of its values missing. Again, it will just skip over that case and generate no output for it. As, I say, it would be a large undertaking to test this in full generality for all the exceptional conditions that can arise.

Notice, also, that I do not deal with date variables. Stata doesn't actually have a separate date variable type. In Stata dates are just numeric variables. There is no general way to identify which variables are intended to be dates and which are not. In general, such variables will have a %td or %tm or %tc, etc. display format. And you could look for that. But people are not always consistent in doing this, and I have also seen some data sets where a variable that is not supposed to be a date nevertheless had date display formatting applied. (Error? Something unusual in mind?) So the code above just treats date variables as the numeric variables they are.

To use this code you must install the -runby- program, by Robert Picard and me, from SSC. The rest of the code is native Stata. -runby- is a way to loop over values of variables more efficiently (in large data sets) and more simply than using nested -foreach- loops.

Added: You also need to get -filelist- from SSC; this one is also by Robert Picard.

Last edited by Clyde Schechter; 04 Apr 2018, 11:09.
2 likes
Comment
wade petty

Join Date: Mar 2018

Posts: 5
#3

04 Apr 2018, 11:24

Clyde,

Thanks so much for your detailed response! I'll work on this today and let you know how things turn out.
Comment
wade petty

Join Date: Mar 2018

Posts: 5
#4

06 Apr 2018, 13:26

Clyde, your code worked very well and is much appreciated.

I did run into a numeric variable that had no observations and the log was showing 'invalid syntax' because of attempting to use `r(min)' when there were no observations for the summ. I accounted for that by taking regular summ and conditioning on `r(N)' as follows:

Code:

if c(rc) == 0 { //numeric variables gen isnumeric = 1 count if `varname' == 0 gen countzero = r(N) summ `varname' if `r(N)'==0 { //case where a numeric variable is completely missing gen valmin = . gen valmax = . } else { gen valmin = `r(min)' gen valmax = `r(max)' } gen charmin = . gen charmax = . }

I'm guessing using the -summ `varname', meanonly- is to save time, but I needed a simple way to test for all missing on numeric variables.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30085
#5

06 Apr 2018, 14:29

The -meanonly- option is, indeed, used to speed things up. But it'svery strangely named. It sounds like it only calculates the mean. But, in fact, it calculates all of the statistics you ordinarily get with -summarize- except the standard deviation. The standard deviation takes extra time because it requires two passes through the data and also taking a square root. But you can still use -meanonly- when you need the min or max or N: they come along despite the name!
Comment

wade petty

Join Date: Mar 2018
Posts: 5

16 Apr 2018, 12:27

Figured I would add the final code I've been running to attempt to deal with the dates in case someone comes along later needing something similar. There's also a statement that tries to handle whitespace as some of the observations seem to be a single or several spaces and I'm converting them to missing here. On initial tests of a handful of datasets, the code appears to be producing the table I want. The only issue now is that it's running through maybe 2000 datasets, some of them are fairly large, and the program is taking literally days to run. In looking at this version, are there any glaring efficiency gains you can see that I could gain by doing anything differently? Thanks for any thoughts on this.

Code:

//get list of all datasets to go through
filelist, dir(`sourceDir') pattern(*.dta)
//drop merged or m datasets created after the fact for analysis
drop if strpos(filename,"_m")>0 | strpos(filename,"merged")>0 | strpos(filename,"clean")>0

//create program to get -describe- info, and convert dates where named date
capture program drop one_file
program define one_file
    local dirname = dirname[1]
    local filename = filename[1]
    use `"`dirname'/`filename'"', clear
    foreach var of varlist _all {
        capture confirm numeric var `var'
        if c(rc)>0 {
            if substr("`var'",-4,4)=="date" {
                gen `var'datevar = date(`var',"YMD")
                format `var'datevar %td
                drop `var'
                rename `var'datevar `var'
            }
        }
    }
    describe, replace clear
    gen dirname = `"`dirname'"'
    gen filename = `"`filename'"'
    exit
end

//runby will run the program agains each dataset in the list from above
runby one_file, by(dirname filename) verbose

timer list 1

//create program to get other descriptors and merge with -describe- info
capture program drop supplementary
program define supplementary
    tempfile copy
    save `copy'
    local dirname = dirname[1]
    local filename = filename[1]
    local varname = name[1]
    use `"`dirname'/`filename'"', clear
    foreach var of varlist _all {
        capture confirm numeric var `var'
        if c(rc)>0 {
            if substr("`var'",-4,4)=="date" {
                gen `var'datevar = date(`var',"YMD")
                format `var'datevar %td
                drop `var'
                rename `var'datevar `var'
            }
        }
    }
    tempfile withdates
    save `withdates'
    use `varname' using `withdates', clear
    gen dirname = `"`dirname'"'
    gen filename = `"`filename'"'
    if inlist(`"`varname'"', "name", "dirname", "filename", "isnumeric", ///
        "countmissing" "countzero", "totalobs", "valmin", "valmax") ///
        | inlist(`"`varname'"', "charmin", "charmax", "datemin", "datemax") {
        rename `varname' varname
        gen name = "`varname'"
        local varname varname
    }
    else {
        gen name = `"`varname'"'
    }
    display dirname  //so I can see how far along the progress is
    display filename
    display name
    display "`varname'"
    capture confirm numeric var `varname'
    if c(rc)> 0 {
        replace `varname' = "" if trim(`varname') == ""
    }
    count if missing(`varname')
    gen countmissing = r(N)
    count if !missing(`varname')
    gen totalobs = r(N)
    capture confirm numeric var `varname'
    if c(rc) == 0 {
        gen isnumeric =  1
        count if `varname' == 0
        gen countzero = r(N)
        summ `varname', meanonly
        if `r(N)'==0 {
            gen valmin = .
            gen valmax = .
        }
        else {
            gen valmin = `r(min)'
            gen valmax = `r(max)'
        }
        gen charmin = .
        gen charmax = .
        if substr("`varname'",-4,4)=="date" & `r(N)' > 0 {
            gen datemin = `r(min)'
            //format datemin %td
            gen datemax = `r(max)'
            //format datemax %td
        }
        else {
            gen datemin = .
            gen datemax = .
        }
    }
    else {
        gen isnumeric = 0
        gen countzero = .
        gen valmin = .
        gen valmax = .
        gen length = length(`varname')
        summ length, meanonly
        gen charmin = `r(min)'
        gen charmax = `r(max)'
        gen datemin = .
        gen datemax = .        
    }
    keep dirname filename name isnumeric countmissing countzero totalobs ///
        valmin valmax charmin charmax datemin datemax
    keep in 1
    merge 1:1 dirname filename name using `copy', assert(match) nogenerate
    exit
end

runby supplementary, by(dirname filename name) verbose

timer list 1

order position, before(name)
order type, after(name)
order format, after(type)
drop varlab
drop vallab
format datemin %td
format datemax %td

Announcement

Trying to create a table of descriptive stats with both numeric and categorical variables, need advice

Comment

Comment

Comment

Comment

Comment