Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending with a Variable that has Different Data Types

    Hello,

    I am currently working with some UN COMTRADE export data. I obtained a zipped folder with CSV files for each country-year I am interested in. After unzipping these, I used the following adapted code to get the files in DTA format:

    Code:
    // Getting CSV Files in DTA Files: https://www.statalist.org/forums/forum/general-stata-discussion/general/74053-import-multiple-csv-files
    clear
    cd "C:\Users\natal\OneDrive\Documents\Research\3rd Year Project\UN COMTRADE Data\WITS\COMTRADE1"
    local myfilelist : dir . files"*.csv"
    
    foreach file of local myfilelist {
        drop _all
        insheet using "`file'"
        local outfile = subinstr("`file'",".csv","",.)
        save "`outfile'", replace
    }
    This seems to have worked fine. The variables I have in each of these are:

    -----------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    -----------------------------------------------------------------------
    reporteriso3 str3 %9s ReporterISO3
    partneriso3 str3 %9s PartnerISO3
    productcode str6 %9s ProductCode
    year int %8.0g Year
    tradeflow byte %8.0g TradeFlow
    tradevalue float %9.0g TradeValue
    quantity double %10.0g Quantity
    qtytoken byte %8.0g QtyToken
    ---------------------------------------

    The problem is when I try to append these country-year files. This is the code I've landed on at the moment:

    Code:
    tempfile building
    save `building', emptyok
    
    // Appending all DTA Files: https://stats.idre.ucla.edu/stata/faq/how-can-i-combine-a-large-number-of-files/
    ! dir *.dta /a-d /b > filelist.txt
    
    file open myfile using filelist.txt, read
    file read myfile line
    
    foreach i in `line' {
        while r(eof)==0 {
            use `line', clear
            append using `building'
            save `"`building'"', replace
            file read myfile line
        }
    }
    
    use `building', clear


    The variable, 'productcode', is sometimes stored as a 'str6' type but other times stored as something else, like long. When I try to append, I get the following:

    variable productcode is str6 in master but long in using data
    You could specify append's force option to ignore this
    string/numeric mismatch. The using variable would then be
    treated as if it contained "".
    r(106);

    When I try the force option, Stata replaces productcode from the non-master dataset type with blanks.

    I've tried looking around the Internet for solutions, like tostring, generating new variables (with and without setting the type for new variables), and merging those missing data back in (says they are not unique). Quite frustrating...

    I would be extremely grateful for any solutions the Stata community has to offer. Thank you.
    Last edited by Natalie Ho; 30 Mar 2018, 15:34.

  • #2
    This is a very common problem. Even when the source of a family of data sets is careful and fastidious, one often encounters incompatibilities between the different data sets: different names for the same variable, different variables with the same name, different storage types for the same variable, different encoding of the same variable.

    While others use different approaches, my preferred way of dealing with this is, after inspecting the data sets visually, to settle on a single specification that I want my combined data set to have and then to separately clean each individual data set forcing it into compliance with that specification. So if I decide that I want productcode to be a string in the final data set, I would go through each data set, and for those that have it as numeric, applying -tostring-. When doing that, you also have to decide upon the details of the string: will it be left-padded with zeroes if it isn't a full 6 digits? If so, you need to specify the -format(%06.0f)- option on your -tostring- command. If, on the other hand, you wanted productcode to be numeric in the combined data, then you would need to use -destring- in those data sets that have it as a string.

    If there are only a handful of incompatibilities among the data sets, then it may be just as easy to do the data cleaning in the same do-file that manages the -append-ing, but with more than that the code just gets too confusing, with too many different -if- commands controlling too many different branches in the code. That's why I generally clean each file separately at first.

    By the way, there are simpler and more efficient ways to code a mass append than what you have shown. There is nothing wrong with that code, but it is easier to do this with more modern commands.

    Code:
    clear*
    
    capture program drop one_file
    program define one_file
        local dirname = dirname[1]
        local filename = filename[1]
        use `dirname'/`filename', clear
        exit
    end
    
    filelist, pattern(*.dta)
    runby one_file, by(dirname filename) verbose
    To do it this way, you need to install the -filelist- command, by Robert Picard, from SSC and the -runby- command, by Robert Picard and me, also from SSC. Again, I think this works best if you pre-clean the individual files, but if you prefer you can incorporate the cleaning code inside program one_file. If, for example you wanted to always make productcode a string variable you would do it as:

    Code:
    clear*
    
    capture program drop one_file
    program define one_file
        local dirname = dirname[1]
        local filename = filename[1]
        use `dirname'/`filename', clear
        capture confirm string var product code
        if c(rc) === 7 {
            tostring productcode, replace
        }
        exit c(rc)
    end
    
    filelist, pattern(*.dta)
    runby one_file, by(dirname filename) verbose







    Comment

    Working...
    X