Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending different data types

    Hello all,

    I am quit new to stata and I am already trying to get a big job done by appending 81 different files. The files are downloaded (.csv and .xlsx formats) and the data needs to be appended. The files are quite big with 183 variables each and observations between 5,000 and 250,000.

    However, the variables in the different files are sometimes of a different data types. Therefore, the only way to append them is using force but this will drop some data (I read Stata tip 73: append with care!).

    The differences between files can be quite big. For example, one file has all strX types, while another one has the following:

    storage display value
    variable name type format label variable label
    ------------------------------------------------------------------------------------
    ar1 int %td.. AR1
    ar2 str8 %9s AR2
    ar3 long %10.0g AR3
    ar4 str4 %9s AR4
    ar5 str25 %25s AR5
    ar6 str25 %25s AR6
    ar7 long %10.0g AR7
    ar8 str6 %9s AR8
    ar15 str4 %9s AR15
    ar16 byte %10.0g AR16
    ar17 byte %10.0g AR17
    ar18 byte %10.0g AR18
    ar19 str4 %9s AR19
    ar20 byte %10.0g AR20
    ar21 str4 %9s AR21
    ar22 byte %10.0g AR22
    ar23 str4 %9s AR23
    ar24 str4 %9s AR24
    ar25 byte %10.0g AR25
    ar26 str9 %9s AR26
    ar27 str4 %9s AR27
    ar28 str8 %9s AR28
    ar29 str4 %9s AR29
    ar30 byte %10.0g AR30
    ar31 str4 %9s AR31
    ar32 str4 %9s AR32
    ar33 str4 %9s AR33
    ar34 str4 %9s AR34
    ar35 str4 %9s AR35
    ar36 byte %10.0g AR36
    ar37 byte %10.0g AR37
    ar38 byte %10.0g AR38
    ar39 byte %10.0g AR39
    ar40 byte %10.0g AR40
    ar41 byte %10.0g AR41
    ar42 byte %10.0g AR42
    ar43 byte %10.0g AR43
    ar44 byte %10.0g AR44
    ar45 byte %10.0g AR45
    ar46 byte %10.0g AR46
    ar47 byte %10.0g AR47
    ar48 byte %10.0g AR48
    ar49 byte %10.0g AR49
    ar55 str7 %9s AR55
    ar56 str7 %9s AR56
    ar57 str7 %9s AR57
    ar58 str4 %9s AR58
    ar59 str4 %9s AR59
    ar60 str4 %9s AR60
    ar61 str4 %9s AR61
    ar62 byte %10.0g AR62
    ar63 byte %10.0g AR63
    ar64 byte %10.0g AR64
    ar65 str4 %9s AR65
    ar66 str9 %9s AR66
    ar67 double %10.0g AR67
    ar68 str4 %9s AR68
    ar69 str4 %9s AR69
    ar70 str4 %9s AR70
    ar71 str7 %9s AR71
    ar72 str4 %9s AR72
    ar73 byte %10.0g AR73
    ar74 byte %10.0g AR74
    ar75 byte %10.0g AR75
    ar76 byte %10.0g AR76
    ar77 byte %10.0g AR77
    ar78 str4 %9s AR78
    ar79 str4 %9s AR79
    ar80 str9 %9s AR80
    ar81 byte %10.0g AR81
    ar82 byte %10.0g AR82
    ar83 byte %10.0g AR83
    ar84 str4 %9s AR84
    ar85 byte %10.0g AR85
    ar86 byte %10.0g AR86
    ar87 str4 %9s AR87
    ar88 byte %10.0g AR88
    ar89 byte %10.0g AR89
    ar90 byte %10.0g AR90
    ar91 str4 %9s AR91
    ar92 byte %10.0g AR92
    ar93 byte %10.0g AR93
    ar94 str9 %9s AR94
    ar95 str4 %9s AR95
    ar96 str4 %9s AR96
    ar97 byte %10.0g AR97
    ar98 byte %10.0g AR98
    ar99 byte %10.0g AR99
    ar100 byte %10.0g AR100
    ar101 str4 %9s AR101
    ar107 str4 %9s AR107
    ar108 str4 %9s AR108
    ar109 double %10.0g AR109
    ar110 str4 %9s AR110
    ar111 str4 %9s AR111
    ar112 byte %10.0g AR112
    ar113 str4 %9s AR113
    ar114 str7 %9s AR114
    ar115 str4 %9s AR115
    ar116 str4 %9s AR116
    ar117 str4 %9s AR117
    ar118 str4 %9s AR118
    ar119 str4 %9s AR119
    ar120 byte %10.0g AR120
    ar121 str4 %9s AR121
    ar122 byte %10.0g AR122
    ar128 str5 %9s AR128
    ar129 str4 %9s AR129
    ar130 str4 %9s AR130
    ar131 str4 %9s AR131
    ar132 str4 %9s AR132
    ar133 byte %10.0g AR133
    ar134 byte %10.0g AR134
    ar135 str6 %9s AR135
    ar136 str10 %10s AR136
    ar137 str4 %9s AR137
    ar138 str7 %9s AR138
    ar139 byte %10.0g AR139
    ar140 byte %10.0g AR140
    ar141 str6 %9s AR141
    ar142 byte %10.0g AR142
    ar143 str10 %10s AR143
    ar144 str4 %9s AR144
    ar145 str7 %9s AR145
    ar146 byte %10.0g AR146
    ar147 byte %10.0g AR147
    ar148 byte %10.0g AR148
    ar149 str10 %10s AR149
    ar150 str4 %9s AR150
    ar151 byte %10.0g AR151
    ar152 byte %10.0g AR152
    ar153 byte %10.0g AR153
    ar154 byte %10.0g AR154
    ar155 byte %10.0g AR155
    ar156 str4 %9s AR156
    ar157 byte %10.0g AR157
    ar158 str4 %9s AR158
    ar159 str4 %9s AR159
    ar160 str4 %9s AR160
    ar166 byte %10.0g AR166
    ar167 byte %10.0g AR167
    ar168 byte %10.0g AR168
    ar169 str7 %9s AR169
    ar170 str4 %9s AR170
    ar171 str7 %9s AR171
    ar172 str7 %9s AR172
    ar173 byte %10.0g AR173
    ar174 str4 %9s AR174
    ar175 str7 %9s AR175
    ar176 byte %10.0g AR176
    ar177 str4 %9s AR177
    ar178 str4 %9s AR178
    ar179 str4 %9s AR179
    ar180 str4 %9s AR180
    ar181 str4 %9s AR181
    ar182 byte %10.0g AR182
    ar183 str4 %9s AR183

    Do you guys have a possible solution for this problem?


    Regards,

    Danny

  • #2
    This is a very common problem when trying to append large numbers of spreadsheet files. Spreadsheets impose no discipline on the types of data that can be entered in the same column. When you import them to Stata, there is a rigid distinction between string and numeric types, causing the situation you find yourself in. It is then incumbent on you to create order from the chaos.

    So the first question you need to decide is which of these variables actually should be strings, and which should actually be numbers. Only you know what these data sets are about, so only you can make this decision. Then you have to set up local macros identifying those two subsets of the variables. Then in your append loop you have to include code which forces those variables into those types before you append.

    Just to illustrate how this might look, I'll make the simplifying assumption that ar1 through ar90 should always be numeric, and ar91 through ar183 should always be strings.

    Code:
    // CREATE MACROS IDENTIFYING STRING AND NUMERIC
    // VARIABLES--CHANGE THIS TO MATCH YOUR ACTUAL
    // SITUATION
    local numeric_vars
    forvalues i = 1/90 {
         local numeric_vars `numeric_vars' ar`i'
    }
    local string_vars
    forvalues i = 91/183 {
        local string_vars `string_vars' ar`i'
    }
    
    // LOCAL MACRO CONTAINING NAMES OF THE DATA FILES
    local filenames whatever 
    
    clear
    tempfile building
    save `building', emptyok
    foreach f of local filenames {
        import excel using `f', // ADD APPROPRIATE OPTIONS
        destring `numeric_vars', replace
        tostring `string_vars', replace
        append using `building'
        save `"`building'"', replace
    }
    Obviously this is just a template and you need to tailor it to your specific situation. It may be that the variables you want to have as strings and those you want to have as numbers do not fall into convenient blocks like 1 through 90 and 91 through 183. You may, in fact, be reduced to listing them out one-by-one: that is tedious, but only needs to be done once.

    Note that the -destring- and -tostring- commands will give you a warning message when the variables it encounters are already numeric or string, respectively. But they won't halt execution: that's acceptable. They will, however, halt execution, if you try to -destring- a variable that contains non-numeric content. If you encounter this situation, it means that your expectation that that variable is supposed to always be a number is not upheld in the data. You then need to see if the data file is in error or if your expectation was wrong.

    Another problem you may encounter is that -tostring- may refuse to convert a floating point number to a string due to loss of precision. Generally speaking, the precision that will be lost will be in the far decimal places that you don't really care about anyway. So, you can add the -force- option to the -tostring- command. But be aware that the resulting strings will differ slightly from what you expect in the low order digits and don't panic over it. In fact, presumably if you are converting these variables to strings you don't plan to do calculations with them, so precision issues are not terribly important. The only real problem would be with something like a numeric identifier variable (think social security numbers), where getting the last digit wrong is a real problem. I don't know if any of your variables are like that. Looking at the output you posted, it seems that nearly all of your numeric variables are small integers--which won't give rise to this problem. It will only arise with floats and doubles.

    Comment


    • #3
      Dear Mr. Schechter,

      Thank you for your help and showing me the macro.

      You are right that I am aware which variables should be strings and which one should be numbers. However, in some files missing data is not filled in (which is fine for appending), but in others it is stated as N.A. (not available). This creates problems when I want to destring the data. Therefore, I am now considering to tostring all variables in all datafiles (tostring *, replace). Then append, and afterwards look at the necessary variables and change the N.A. (to missing data .). Finally I can then destring the variables that should be numbers. I think this saves time because I can look at the total dataset after the appending and change the text (N.A.) when the variable is numeric in stead of doing this for each individual file (81 pieces).

      Do you think this is a reasonable solution?

      Regards,
      Danny

      Comment


      • #4
        My recommendation to all appenders is to import the data as string, append, and then destring. The import delimited command has the option

        Code:
        stringcols(numlist|_all)
        The import excel command has the option
        Code:
        allstring
        Once the data is appended into a single big dataset, just do a batch destring. Any variable that contains non-numeric characters will remain string and you can then take action depending on the particulars of that variable.

        Code:
        . webuse destring1
        
        . replace income = "N.A." in 2
        (1 real change made)
        
        . destring *, replace
        id has all characters numeric; replaced as int
        num has all characters numeric; replaced as int
        code has all characters numeric; replaced as int
        total has all characters numeric; replaced as long
        income contains nonnumeric characters; no replace

        Comment


        • #5
          Thank you for this thread. Quite useful. I am new to this as well. Have a question about this. What happens when you have date variables? Do you have to do anything special?

          Comment


          • #6
            Well, if the variables in the spreadsheets are all valid dates and the Excel spreadsheet has applied it's own date formatting to them, there is no problem. -import excel- will recognize this and bring them in as Stata internal format numeric date variables and even apply a nice display format to them so that you see them as dates when you -list- or -browse- them.

            But my experience with spreadsheet data is that it is often very anarchic. Date variables in particular tend to be entered in haphazard ways, and Excel and Stata are both easily confused by the bewildering variety of formats used. This will sometimes result in a "date" variable being imported as a string variable into Stata, often with variations in the string representation. So in this string variable, you may find one observation says "1-Jan-2015" and another says "Jan 1, 2015" and another says "20150101". So converting this kind of thing into a single Stata internal format date variable involves several lines of code, as each format requires a separate command.

            Spreadsheets obtained from professional data managements organizations or people are usually better, but even there, errors are not at all rare. Regardless of source, you would be wise to inspect each spreadsheet by eye to look for this kind of problem ahead of time. And again, importing each spreadsheet into a separate Stata .dta file and cleaning those individually before finally appending the cleaned versions together is usually easier.

            Comment


            • #7
              Thank you so much for this response.

              Kindly see the code below:



              filelist, p("*.xls*") list
              gen fileid = _n
              save "RADET2016v2", replace
              * loop over each file and save a copy of the imported data;
              use "RADET2016v2", clear
              local obs = _N
              forvalues i=1/`obs' {
              quiet use "RADET2016v2" in `i'
              local f = dirname + "/" + filename
              quiet xls2dta using "`f'", allsheets(2*) cellrange(B2:Z5001) clear replace
              }

              clear

              !dir *_2016.dta /a-d /o /b >filelist16.txt
              file open myfile using filelist16.txt, read
              file read myfile line
              while r(eof)==0 { /* while you're not at the end of the file */
              quiet append using `line'
              file read myfile line
              }
              file close myfile

              !dir *_2015.dta /a-d /o /b >filelist15.txt
              file open myfile using filelist15.txt, read
              file read myfile line
              while r(eof)==0 { /* while you're not at the end of the file */
              quiet append using `line'
              file read myfile line
              }
              file close myfile

              !dir *_2014.dta /a-d /o /b >filelist14.txt
              file open myfile using filelist14.txt, read
              file read myfile line
              while r(eof)==0 { /* while you're not at the end of the file */
              quiet append using `line'
              file read myfile line
              }
              file close myfile

              !dir *_2013.dta /a-d /o /b >filelist13.txt
              file open myfile using filelist13.txt, read
              file read myfile line
              while r(eof)==0 { /* while you're not at the end of the file */
              quiet append using `line'
              file read myfile line
              }
              file close myfile

              .... {all the way to 2004}

              We have an Excel tool that has data on tabs labelled from 2004 to 2016. This is health facility data and essentially each facility has 13 worksheets in each tool. We have over 1,000 health facilities. The code below runs to extract each of the sheets and then merge all the data into 1 stata file. We have used this code for more than 2 years now and to a large extent gets the job done. On occasion, we get a dataset with a lot of missing observations and have to resort to other methods to pull the data out (including manual).

              A few days ago, I tried to run this code for the tools from health facilities in 2 states and then started experiencing this error:


              . !dir *_2016.dta /a-d /o /b >filelist16.txt

              . file open myfile using filelist16.txt, read

              . file read myfile line

              . while r(eof)==0 { /* while you're not at the end of the file */
              2. quiet append using `line', force
              3. file read myfile line
              4. }
              invalid 'force'
              r(198);

              end of do-file

              When I remove the force, I get this error instead:


              . !dir *_2016.dta /a-d /o /b >filelist16.txt

              . file open myfile using filelist16.txt, read

              . file read myfile line

              . while r(eof)==0 { /* while you're not at the end of the file */
              2. quiet append using `line'
              3. file read myfile line
              4. }
              variable H is byte in master but str5 in using data
              You could specify append's force option to ignore this numeric/string mismatch. The using variable would then be treated as if it contained numeric missing value.


              r(106);

              end of do-file

              r(106);

              In the past 2 years, this is the first time I am experiencing this error and not sure what to do with it. I understand from the thread above what is happening with the second error but can't quite figure what is happening with the first. (I hope this makes sense).

              Your review and advise are most appreciated.

              Thank you.

              Comment


              • #8
                Well, it's very hard to know exactly what's going on here. My hunch is that `line' contains something funky which is causing Stata to choke on it and misinterpret the options part of your -append- command when you specify -force-. I think I would insert a -disply `"`line'"' command right before the -append- command to see just what's being fed to -append-. You may find something there that explains it all. (In particular, my best guess would be that there is a comma character in `line' causing the problem.)

                Comment


                • #9
                  Shall you have difficulties in finding the 'culprit(s)', you may use 'force', then check the 'new' missing data and edit it.
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Thank you Clyde. I found a comma in one of the file names and this may likely be the culprit. I will run the code again and let you know the outcome. The help is most appreciated. Thank you Marcos too.

                    Comment


                    • #11
                      Hi I have a similar question. I have two different dta files for two states who have exactly the same variable name. But the variable types differ between the two dta files. For example if Statecode is byte in dta 1 it is str5 in dta 2. As such there is no pattern.
                      I wish to append the two dta files, and am aware of the problems of using the force option with the append command.

                      I tried to do what Clyde suggested. Here is my code of that. But this did not work.

                      use "Z:\Public Good\Bihar.dta" //dta 1 which is my master//

                      ds, has(type numeric)
                      local numeric_var "`n(varlist)'"

                      ds, has(type string)
                      local string_var "`s(varlist)'"


                      // LOCAL MACRO CONTAINING NAMES OF THE DATA FILES
                      local filenames "Z:\Public Good\UP.dta" //dta 2 which has to be appended to dta 1//
                      clear
                      tempfile bihar_up
                      save `bihar_up', emptyok
                      foreach f of local filenames {
                      destring `numeric_var', replace
                      tostring `string_var', replace
                      append using `bihar_up'
                      save `"`bihar_up'"', replace
                      }

                      Could anyone suggest?
                      Last edited by Priyoma Mustafi; 31 Jul 2018, 11:49.

                      Comment

                      Working...
                      X