Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending with a Variable that has Different Data Types

    Hello,

    I am currently working with some UN COMTRADE export data. I obtained a zipped folder with CSV files for each country-year I am interested in. After unzipping these, I used the following adapted code to get the files in DTA format:

    Code:
    // Getting CSV Files in DTA Files: https://www.statalist.org/forums/forum/general-stata-discussion/general/74053-import-multiple-csv-files
    clear
    cd "C:\Users\natal\OneDrive\Documents\Research\3rd Year Project\UN COMTRADE Data\WITS\COMTRADE1"
    local myfilelist : dir . files"*.csv"
    
    foreach file of local myfilelist {
        drop _all
        insheet using "`file'"
        local outfile = subinstr("`file'",".csv","",.)
        save "`outfile'", replace
    }
    This seems to have worked fine. The variables I have in each of these are:

    -----------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    -----------------------------------------------------------------------
    reporteriso3 str3 %9s ReporterISO3
    partneriso3 str3 %9s PartnerISO3
    productcode str6 %9s ProductCode
    year int %8.0g Year
    tradeflow byte %8.0g TradeFlow
    tradevalue float %9.0g TradeValue
    quantity double %10.0g Quantity
    qtytoken byte %8.0g QtyToken
    ---------------------------------------

    The problem is when I try to append these country-year files. This is the code I've landed on at the moment:

    Code:
    tempfile building
    save `building', emptyok
    
    // Appending all DTA Files: https://stats.idre.ucla.edu/stata/faq/how-can-i-combine-a-large-number-of-files/
    ! dir *.dta /a-d /b > filelist.txt
    
    file open myfile using filelist.txt, read
    file read myfile line
    
    foreach i in `line' {
        while r(eof)==0 {
            use `line', clear
            append using `building'
            save `"`building'"', replace
            file read myfile line
        }
    }
    
    use `building', clear


    The variable, 'productcode', is sometimes stored as a 'str6' type but other times stored as something else, like long. When I try to append, I get the following:

    variable productcode is str6 in master but long in using data
    You could specify append's force option to ignore this
    string/numeric mismatch. The using variable would then be
    treated as if it contained "".
    r(106);

    When I try the force option, Stata replaces productcode from the non-master dataset type with blanks.

    I've tried looking around the Internet for solutions, like tostring, generating new variables (with and without setting the type for new variables), and merging those missing data back in (says they are not unique). Quite frustrating...

    I would be extremely grateful for any solutions the Stata community has to offer. Thank you.
    Last edited by Natalie Ho; 30 Mar 2018, 15:34.

  • #2
    This is a very common problem. Even when the source of a family of data sets is careful and fastidious, one often encounters incompatibilities between the different data sets: different names for the same variable, different variables with the same name, different storage types for the same variable, different encoding of the same variable.

    While others use different approaches, my preferred way of dealing with this is, after inspecting the data sets visually, to settle on a single specification that I want my combined data set to have and then to separately clean each individual data set forcing it into compliance with that specification. So if I decide that I want productcode to be a string in the final data set, I would go through each data set, and for those that have it as numeric, applying -tostring-. When doing that, you also have to decide upon the details of the string: will it be left-padded with zeroes if it isn't a full 6 digits? If so, you need to specify the -format(%06.0f)- option on your -tostring- command. If, on the other hand, you wanted productcode to be numeric in the combined data, then you would need to use -destring- in those data sets that have it as a string.

    If there are only a handful of incompatibilities among the data sets, then it may be just as easy to do the data cleaning in the same do-file that manages the -append-ing, but with more than that the code just gets too confusing, with too many different -if- commands controlling too many different branches in the code. That's why I generally clean each file separately at first.

    By the way, there are simpler and more efficient ways to code a mass append than what you have shown. There is nothing wrong with that code, but it is easier to do this with more modern commands.

    Code:
    clear*
    
    capture program drop one_file
    program define one_file
        local dirname = dirname[1]
        local filename = filename[1]
        use `dirname'/`filename', clear
        exit
    end
    
    filelist, pattern(*.dta)
    runby one_file, by(dirname filename) verbose
    To do it this way, you need to install the -filelist- command, by Robert Picard, from SSC and the -runby- command, by Robert Picard and me, also from SSC. Again, I think this works best if you pre-clean the individual files, but if you prefer you can incorporate the cleaning code inside program one_file. If, for example you wanted to always make productcode a string variable you would do it as:

    Code:
    clear*
    
    capture program drop one_file
    program define one_file
        local dirname = dirname[1]
        local filename = filename[1]
        use `dirname'/`filename', clear
        capture confirm string var product code
        if c(rc) === 7 {
            tostring productcode, replace
        }
        exit c(rc)
    end
    
    filelist, pattern(*.dta)
    runby one_file, by(dirname filename) verbose







    Comment


    • #3
      Hello everyone,

      I am trying to append these four other databases with mhws_rd1.dta. There are many similar variables with different types. What code can I use so that, when appending the other datasets, the variables adopt the type of the first one?

      Thank you


      use "C:\Users\Utilisateur\OneDrive - Université Laval\Article 1 thesis\Data Havard\mhws_rd1.dta", clear


      append using "C:\Users\Utilisateur\OneDrive - Université Laval\Article 1 thesis\Data Havard\mhws_rd2.dta", force


      append using "C:\Users\Utilisateur\OneDrive - Université Laval\Article 1 thesis\Data Havard\mhws_rd3.dta", force

      append using "C:\Users\Utilisateur\OneDrive - Université Laval\Article 1 thesis\Data Havard\mhws_rd4.dta", force

      append using "C:\Users\Utilisateur\OneDrive - Université Laval\Article 1 thesis\Data Havard\mhws_rd5.dta", force

      append using "C:\Users\Utilisateur\OneDrive - Université Laval\Article 1 thesis\Data Havard\mhws_rd6.dta", force

      Here is the code that I tried, uncessful!

      clear *

      capture program drop one_file
      program define one_file
      * Définir le répertoire et le nom du fichier
      local dirname "C:\Users\Utilisateur\OneDrive - Université Laval\Article 1 thesis\Data Havard"
      local filename = filename[1]

      * Charger le premier fichier
      use `dirname'\`filename', clear

      * Ajouter les autres fichiers
      append using "`dirname'\mhws_rd2.dta", force
      append using "`dirname'\mhws_rd3.dta", force
      append using "`dirname'\mhws_rd4.dta", force
      append using "`dirname'\mhws_rd5.dta", force
      append using "`dirname'\mhws_rd6.dta", force

      exit
      end

      * Lister les fichiers .dta dans le répertoire
      filelist, pattern(*mhws_rd1.dta)

      * Exécuter le programme one_file pour charger et ajouter les fichiers
      runby one_file, by(dirname filename) verbose

      Comment


      • #4
        There seem to be several issues here.
        1. Please go through Statalist FAQ, especially #12. In particular, please enclose your code with CODE blocks (click the # button on the toolbar above the editor box). Without this, your code is hard to read.
        2. Your code would also be easier to read if you indent it properly -- e.g. the contents of the program ... end block should be indented.
        3. Your use of backslashes (that is, the \ character) in file paths will create problems, since the backslash is used as an escape character in Stata. In particular, in \`filename', the filename will not be read correctly as a local macro since Stata will think you are trying to escape the ` character. Please use forward slashes / in file paths instead. These are fine in Stata even on Windows machines.
        4. You have borrowed some code from #2, but mashed it up with your own code. My personal suggestion would be to NOT use filelist and runby, and the program one_file for now. That is slightly advanced usage, and not really necessary for your basic task.
        5. Now to your basic problem, of data type mismatches. The solution for this will depend partly on the exact nature of the problem. As a start, do NOT use the force option of append, and show us the error you get while appending. If you are for instance, facing a problem because some variable is a string in some datasets but numeric in another, then the solution is contained in another part of the code in #2: you should first convert the variable to a string in any dataset where it is numeric, before appending the file. You can do this as follows:
          Code:
          	capture confirm string var varname 
          	if c(rc) == 7 {
          	    tostring varname, replace
          	    }
          where you should substitute varname with the name of the problem variable.

        Comment

        Working...
        X