Hi, I'm new to data management / validation in Stata and am coming from mostly SQL/SAS.
Trying to do:
I've been asked to create a table to help validate a set of datasets against a provided dictionary, so I want descriptions of variables in the datasets.
The desired table headers are:
folder filename position field datatype isnumeric countmissing countzero totalobs valmin valmax charmin charmax datamin datemax
Folder and filename are path related, as I have folders for several regions, each with a set of files named the same for each region. These are here to tell me which file the stats are from.
Field, datatype, isnumeric, and position are outputs from using -describe-. I wanted to list varnames so that I can later compare against the datatypes from the dictionary.
Countmissing and countzero are what they sound like, though zero would apply only to fields with isnumeric flags of 1.
Totalobs is total nonmissing observations in that dataset for given folder/filename/field.
Valmin and valmax are min and max of isnumeric flagged fields, missing for isnumeric==0 fields.
Charmin and charmax are the min and max lengths of the string variables, missing for nonstring fields.
Datemin and datemax are the min and max date values for date variables, missing for nondate fields.
What I've done:
I've looped over all the folders and files using local macros of folders and filenames saving - describe, replace clear - results as datasets, adding generated folder and filename columns along the way, then appending each of these to one main dataset of overall -describe- results.
I created a prototype Excel sheet of what I want the final table to contain, but as I've had difficulty pulling all this info together in my do file, I populated the descriptive stats manually with results from interactively using various commands (min, max, strlen, count var if == 0, count if missing()). I did this only for one region/filename, but this needs to be automated and run on the rest of and future versions of the datasets.
Haven't done:
Used user created ado files. I'm open to any suggestions here, I've been trying to keep the code to just what comes with Stata 15 but at this point I just need results.
What I'm looking for:
General thoughts on how to better approach what I'm after in a Stata way, I'm willing to go any other direction because I'm getting nowhere fast. I'm looking at tabstat, but it doesn't seem to provide descriptions of nonnumeric data. I'm not stuck on using the results from -describe- as I have, if there's a better way to populate that information into a dataset/table I'm all for it.
Ideally I'd like to be able to loop through the folders and files, populate the described summary table from above, appending results along the way into one main dataset. I think there's got to be some simple way to do this that is escaping me and my google/documentation searches.
Is there a way to say "output a dataset where field = varname, position = columnnumber(varname), datatype = datatype(varname), isnumeric = if(datatype(varname) like str,0,1), valmin = if(datatype(varname) like str,".",min(varname))" etc?
Any advice is welcome, including asking for more information or simply advice on how best to get any part of what I'm looking for. System memory generally shouldn't be a concern for this.
Running Stata 15 on Windows 10, though the code shouldn't be too specific to Windows environment in case we pass it on to others on Linux.
Thanks,
Wade Petty
Trying to do:
I've been asked to create a table to help validate a set of datasets against a provided dictionary, so I want descriptions of variables in the datasets.
The desired table headers are:
folder filename position field datatype isnumeric countmissing countzero totalobs valmin valmax charmin charmax datamin datemax
Folder and filename are path related, as I have folders for several regions, each with a set of files named the same for each region. These are here to tell me which file the stats are from.
Field, datatype, isnumeric, and position are outputs from using -describe-. I wanted to list varnames so that I can later compare against the datatypes from the dictionary.
Countmissing and countzero are what they sound like, though zero would apply only to fields with isnumeric flags of 1.
Totalobs is total nonmissing observations in that dataset for given folder/filename/field.
Valmin and valmax are min and max of isnumeric flagged fields, missing for isnumeric==0 fields.
Charmin and charmax are the min and max lengths of the string variables, missing for nonstring fields.
Datemin and datemax are the min and max date values for date variables, missing for nondate fields.
What I've done:
I've looped over all the folders and files using local macros of folders and filenames saving - describe, replace clear - results as datasets, adding generated folder and filename columns along the way, then appending each of these to one main dataset of overall -describe- results.
I created a prototype Excel sheet of what I want the final table to contain, but as I've had difficulty pulling all this info together in my do file, I populated the descriptive stats manually with results from interactively using various commands (min, max, strlen, count var if == 0, count if missing()). I did this only for one region/filename, but this needs to be automated and run on the rest of and future versions of the datasets.
Haven't done:
Used user created ado files. I'm open to any suggestions here, I've been trying to keep the code to just what comes with Stata 15 but at this point I just need results.
What I'm looking for:
General thoughts on how to better approach what I'm after in a Stata way, I'm willing to go any other direction because I'm getting nowhere fast. I'm looking at tabstat, but it doesn't seem to provide descriptions of nonnumeric data. I'm not stuck on using the results from -describe- as I have, if there's a better way to populate that information into a dataset/table I'm all for it.
Ideally I'd like to be able to loop through the folders and files, populate the described summary table from above, appending results along the way into one main dataset. I think there's got to be some simple way to do this that is escaping me and my google/documentation searches.
Is there a way to say "output a dataset where field = varname, position = columnnumber(varname), datatype = datatype(varname), isnumeric = if(datatype(varname) like str,0,1), valmin = if(datatype(varname) like str,".",min(varname))" etc?
Any advice is welcome, including asking for more information or simply advice on how best to get any part of what I'm looking for. System memory generally shouldn't be a concern for this.
Running Stata 15 on Windows 10, though the code shouldn't be too specific to Windows environment in case we pass it on to others on Linux.
Thanks,
Wade Petty
Comment