Appending different data types

Danny Schertz

Join Date: Apr 2016

Posts: 39
#1

Appending different data types

07 Apr 2016, 08:22

Hello all,

I am quit new to stata and I am already trying to get a big job done by appending 81 different files. The files are downloaded (.csv and .xlsx formats) and the data needs to be appended. The files are quite big with 183 variables each and observations between 5,000 and 250,000.

However, the variables in the different files are sometimes of a different data types. Therefore, the only way to append them is using force but this will drop some data (I read Stata tip 73: append with care!).

The differences between files can be quite big. For example, one file has all strX types, while another one has the following:

storage display value
variable name type format label variable label
------------------------------------------------------------------------------------
ar1 int %td.. AR1
ar2 str8 %9s AR2
ar3 long %10.0g AR3
ar4 str4 %9s AR4
ar5 str25 %25s AR5
ar6 str25 %25s AR6
ar7 long %10.0g AR7
ar8 str6 %9s AR8
ar15 str4 %9s AR15
ar16 byte %10.0g AR16
ar17 byte %10.0g AR17
ar18 byte %10.0g AR18
ar19 str4 %9s AR19
ar20 byte %10.0g AR20
ar21 str4 %9s AR21
ar22 byte %10.0g AR22
ar23 str4 %9s AR23
ar24 str4 %9s AR24
ar25 byte %10.0g AR25
ar26 str9 %9s AR26
ar27 str4 %9s AR27
ar28 str8 %9s AR28
ar29 str4 %9s AR29
ar30 byte %10.0g AR30
ar31 str4 %9s AR31
ar32 str4 %9s AR32
ar33 str4 %9s AR33
ar34 str4 %9s AR34
ar35 str4 %9s AR35
ar36 byte %10.0g AR36
ar37 byte %10.0g AR37
ar38 byte %10.0g AR38
ar39 byte %10.0g AR39
ar40 byte %10.0g AR40
ar41 byte %10.0g AR41
ar42 byte %10.0g AR42
ar43 byte %10.0g AR43
ar44 byte %10.0g AR44
ar45 byte %10.0g AR45
ar46 byte %10.0g AR46
ar47 byte %10.0g AR47
ar48 byte %10.0g AR48
ar49 byte %10.0g AR49
ar55 str7 %9s AR55
ar56 str7 %9s AR56
ar57 str7 %9s AR57
ar58 str4 %9s AR58
ar59 str4 %9s AR59
ar60 str4 %9s AR60
ar61 str4 %9s AR61
ar62 byte %10.0g AR62
ar63 byte %10.0g AR63
ar64 byte %10.0g AR64
ar65 str4 %9s AR65
ar66 str9 %9s AR66
ar67 double %10.0g AR67
ar68 str4 %9s AR68
ar69 str4 %9s AR69
ar70 str4 %9s AR70
ar71 str7 %9s AR71
ar72 str4 %9s AR72
ar73 byte %10.0g AR73
ar74 byte %10.0g AR74
ar75 byte %10.0g AR75
ar76 byte %10.0g AR76
ar77 byte %10.0g AR77
ar78 str4 %9s AR78
ar79 str4 %9s AR79
ar80 str9 %9s AR80
ar81 byte %10.0g AR81
ar82 byte %10.0g AR82
ar83 byte %10.0g AR83
ar84 str4 %9s AR84
ar85 byte %10.0g AR85
ar86 byte %10.0g AR86
ar87 str4 %9s AR87
ar88 byte %10.0g AR88
ar89 byte %10.0g AR89
ar90 byte %10.0g AR90
ar91 str4 %9s AR91
ar92 byte %10.0g AR92
ar93 byte %10.0g AR93
ar94 str9 %9s AR94
ar95 str4 %9s AR95
ar96 str4 %9s AR96
ar97 byte %10.0g AR97
ar98 byte %10.0g AR98
ar99 byte %10.0g AR99
ar100 byte %10.0g AR100
ar101 str4 %9s AR101
ar107 str4 %9s AR107
ar108 str4 %9s AR108
ar109 double %10.0g AR109
ar110 str4 %9s AR110
ar111 str4 %9s AR111
ar112 byte %10.0g AR112
ar113 str4 %9s AR113
ar114 str7 %9s AR114
ar115 str4 %9s AR115
ar116 str4 %9s AR116
ar117 str4 %9s AR117
ar118 str4 %9s AR118
ar119 str4 %9s AR119
ar120 byte %10.0g AR120
ar121 str4 %9s AR121
ar122 byte %10.0g AR122
ar128 str5 %9s AR128
ar129 str4 %9s AR129
ar130 str4 %9s AR130
ar131 str4 %9s AR131
ar132 str4 %9s AR132
ar133 byte %10.0g AR133
ar134 byte %10.0g AR134
ar135 str6 %9s AR135
ar136 str10 %10s AR136
ar137 str4 %9s AR137
ar138 str7 %9s AR138
ar139 byte %10.0g AR139
ar140 byte %10.0g AR140
ar141 str6 %9s AR141
ar142 byte %10.0g AR142
ar143 str10 %10s AR143
ar144 str4 %9s AR144
ar145 str7 %9s AR145
ar146 byte %10.0g AR146
ar147 byte %10.0g AR147
ar148 byte %10.0g AR148
ar149 str10 %10s AR149
ar150 str4 %9s AR150
ar151 byte %10.0g AR151
ar152 byte %10.0g AR152
ar153 byte %10.0g AR153
ar154 byte %10.0g AR154
ar155 byte %10.0g AR155
ar156 str4 %9s AR156
ar157 byte %10.0g AR157
ar158 str4 %9s AR158
ar159 str4 %9s AR159
ar160 str4 %9s AR160
ar166 byte %10.0g AR166
ar167 byte %10.0g AR167
ar168 byte %10.0g AR168
ar169 str7 %9s AR169
ar170 str4 %9s AR170
ar171 str7 %9s AR171
ar172 str7 %9s AR172
ar173 byte %10.0g AR173
ar174 str4 %9s AR174
ar175 str7 %9s AR175
ar176 byte %10.0g AR176
ar177 str4 %9s AR177
ar178 str4 %9s AR178
ar179 str4 %9s AR179
ar180 str4 %9s AR180
ar181 str4 %9s AR181
ar182 byte %10.0g AR182
ar183 str4 %9s AR183

Do you guys have a possible solution for this problem?

Regards,

Danny
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#2

07 Apr 2016, 09:58

This is a very common problem when trying to append large numbers of spreadsheet files. Spreadsheets impose no discipline on the types of data that can be entered in the same column. When you import them to Stata, there is a rigid distinction between string and numeric types, causing the situation you find yourself in. It is then incumbent on you to create order from the chaos.

So the first question you need to decide is which of these variables actually should be strings, and which should actually be numbers. Only you know what these data sets are about, so only you can make this decision. Then you have to set up local macros identifying those two subsets of the variables. Then in your append loop you have to include code which forces those variables into those types before you append.

Just to illustrate how this might look, I'll make the simplifying assumption that ar1 through ar90 should always be numeric, and ar91 through ar183 should always be strings.

Code:

// CREATE MACROS IDENTIFYING STRING AND NUMERIC // VARIABLES--CHANGE THIS TO MATCH YOUR ACTUAL // SITUATION local numeric_vars forvalues i = 1/90 { local numeric_vars `numeric_vars' ar`i' } local string_vars forvalues i = 91/183 { local string_vars `string_vars' ar`i' } // LOCAL MACRO CONTAINING NAMES OF THE DATA FILES local filenames whatever clear tempfile building save `building', emptyok foreach f of local filenames { import excel using `f', // ADD APPROPRIATE OPTIONS destring `numeric_vars', replace tostring `string_vars', replace append using `building' save `"`building'"', replace }

Obviously this is just a template and you need to tailor it to your specific situation. It may be that the variables you want to have as strings and those you want to have as numbers do not fall into convenient blocks like 1 through 90 and 91 through 183. You may, in fact, be reduced to listing them out one-by-one: that is tedious, but only needs to be done once.

Note that the -destring- and -tostring- commands will give you a warning message when the variables it encounters are already numeric or string, respectively. But they won't halt execution: that's acceptable. They will, however, halt execution, if you try to -destring- a variable that contains non-numeric content. If you encounter this situation, it means that your expectation that that variable is supposed to always be a number is not upheld in the data. You then need to see if the data file is in error or if your expectation was wrong.

Another problem you may encounter is that -tostring- may refuse to convert a floating point number to a string due to loss of precision. Generally speaking, the precision that will be lost will be in the far decimal places that you don't really care about anyway. So, you can add the -force- option to the -tostring- command. But be aware that the resulting strings will differ slightly from what you expect in the low order digits and don't panic over it. In fact, presumably if you are converting these variables to strings you don't plan to do calculations with them, so precision issues are not terribly important. The only real problem would be with something like a numeric identifier variable (think social security numbers), where getting the last digit wrong is a real problem. I don't know if any of your variables are like that. Looking at the output you posted, it seems that nearly all of your numeric variables are small integers--which won't give rise to this problem. It will only arise with floats and doubles.
1 like
Comment
Danny Schertz

Join Date: Apr 2016

Posts: 39
#3

11 Apr 2016, 01:45

Dear Mr. Schechter,

Thank you for your help and showing me the macro.

You are right that I am aware which variables should be strings and which one should be numbers. However, in some files missing data is not filled in (which is fine for appending), but in others it is stated as N.A. (not available). This creates problems when I want to destring the data. Therefore, I am now considering to tostring all variables in all datafiles (tostring *, replace). Then append, and afterwards look at the necessary variables and change the N.A. (to missing data .). Finally I can then destring the variables that should be numbers. I think this saves time because I can look at the total dataset after the appending and change the text (N.A.) when the variable is numeric in stead of doing this for each individual file (81 pieces).

Do you think this is a reasonable solution?

Regards,
Danny
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

11 Apr 2016, 08:39

My recommendation to all appenders is to import the data as string, append, and then destring. The import delimited command has the option

Code:

stringcols(numlist|_all)

The import excel command has the option

Code:

allstring

Once the data is appended into a single big dataset, just do a batch destring. Any variable that contains non-numeric characters will remain string and you can then take action depending on the particulars of that variable.

Code:

. webuse destring1 . replace income = "N.A." in 2 (1 real change made) . destring *, replace id has all characters numeric; replaced as int num has all characters numeric; replaced as int code has all characters numeric; replaced as int total has all characters numeric; replaced as long income contains nonnumeric characters; no replace
2 likes
Comment
Ifeanyi Okoye

Join Date: Jun 2017

Posts: 3
#5

23 Jun 2017, 05:49

Thank you for this thread. Quite useful. I am new to this as well. Have a question about this. What happens when you have date variables? Do you have to do anything special?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#6

23 Jun 2017, 09:01

Well, if the variables in the spreadsheets are all valid dates and the Excel spreadsheet has applied it's own date formatting to them, there is no problem. -import excel- will recognize this and bring them in as Stata internal format numeric date variables and even apply a nice display format to them so that you see them as dates when you -list- or -browse- them.

But my experience with spreadsheet data is that it is often very anarchic. Date variables in particular tend to be entered in haphazard ways, and Excel and Stata are both easily confused by the bewildering variety of formats used. This will sometimes result in a "date" variable being imported as a string variable into Stata, often with variations in the string representation. So in this string variable, you may find one observation says "1-Jan-2015" and another says "Jan 1, 2015" and another says "20150101". So converting this kind of thing into a single Stata internal format date variable involves several lines of code, as each format requires a separate command.

Spreadsheets obtained from professional data managements organizations or people are usually better, but even there, errors are not at all rare. Regardless of source, you would be wise to inspect each spreadsheet by eye to look for this kind of problem ahead of time. And again, importing each spreadsheet into a separate Stata .dta file and cleaning those individually before finally appending the cleaned versions together is usually easier.
Comment
Ifeanyi Okoye

Join Date: Jun 2017

Posts: 3
#7

24 Jun 2017, 05:08

Thank you so much for this response.

Kindly see the code below:

filelist, p("*.xls*") list
gen fileid = _n
save "RADET2016v2", replace
* loop over each file and save a copy of the imported data;
use "RADET2016v2", clear
local obs = _N
forvalues i=1/`obs' {
quiet use "RADET2016v2" in `i'
local f = dirname + "/" + filename
quiet xls2dta using "`f'", allsheets(2*) cellrange(B2:Z5001) clear replace
}

clear

!dir *_2016.dta /a-d /o /b >filelist16.txt
file open myfile using filelist16.txt, read
file read myfile line
while r(eof)==0 { /* while you're not at the end of the file */
quiet append using `line'
file read myfile line
}
file close myfile

!dir *_2015.dta /a-d /o /b >filelist15.txt
file open myfile using filelist15.txt, read
file read myfile line
while r(eof)==0 { /* while you're not at the end of the file */
quiet append using `line'
file read myfile line
}
file close myfile

!dir *_2014.dta /a-d /o /b >filelist14.txt
file open myfile using filelist14.txt, read
file read myfile line
while r(eof)==0 { /* while you're not at the end of the file */
quiet append using `line'
file read myfile line
}
file close myfile

!dir *_2013.dta /a-d /o /b >filelist13.txt
file open myfile using filelist13.txt, read
file read myfile line
while r(eof)==0 { /* while you're not at the end of the file */
quiet append using `line'
file read myfile line
}
file close myfile

.... {all the way to 2004}

We have an Excel tool that has data on tabs labelled from 2004 to 2016. This is health facility data and essentially each facility has 13 worksheets in each tool. We have over 1,000 health facilities. The code below runs to extract each of the sheets and then merge all the data into 1 stata file. We have used this code for more than 2 years now and to a large extent gets the job done. On occasion, we get a dataset with a lot of missing observations and have to resort to other methods to pull the data out (including manual).

A few days ago, I tried to run this code for the tools from health facilities in 2 states and then started experiencing this error:

. !dir *_2016.dta /a-d /o /b >filelist16.txt

. file open myfile using filelist16.txt, read

. file read myfile line

. while r(eof)==0 { /* while you're not at the end of the file */
2. quiet append using `line', force
3. file read myfile line
4. }
invalid 'force'
r(198);

end of do-file

When I remove the force, I get this error instead:

. !dir *_2016.dta /a-d /o /b >filelist16.txt

. file open myfile using filelist16.txt, read

. file read myfile line

. while r(eof)==0 { /* while you're not at the end of the file */
2. quiet append using `line'
3. file read myfile line
4. }
variable H is byte in master but str5 in using data
You could specify append's force option to ignore this numeric/string mismatch. The using variable would then be treated as if it contained numeric missing value.

r(106);

end of do-file

r(106);

In the past 2 years, this is the first time I am experiencing this error and not sure what to do with it. I understand from the thread above what is happening with the second error but can't quite figure what is happening with the first. (I hope this makes sense).

Your review and advise are most appreciated.

Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#8

24 Jun 2017, 09:52

Well, it's very hard to know exactly what's going on here. My hunch is that `line' contains something funky which is causing Stata to choke on it and misinterpret the options part of your -append- command when you specify -force-. I think I would insert a -disply `"`line'"' command right before the -append- command to see just what's being fed to -append-. You may find something there that explains it all. (In particular, my best guess would be that there is a comma character in `line' causing the problem.)
1 like
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

24 Jun 2017, 10:02

Shall you have difficulties in finding the 'culprit(s)', you may use 'force', then check the 'new' missing data and edit it.

Best regards,

Marcos
Comment
Ifeanyi Okoye

Join Date: Jun 2017

Posts: 3
#10

24 Jun 2017, 13:41

Thank you Clyde. I found a comma in one of the file names and this may likely be the culprit. I will run the code again and let you know the outcome. The help is most appreciated. Thank you Marcos too.
Comment
Priyoma Mustafi

Join Date: Mar 2017

Posts: 12
#11

31 Jul 2018, 11:41

Hi I have a similar question. I have two different dta files for two states who have exactly the same variable name. But the variable types differ between the two dta files. For example if Statecode is byte in dta 1 it is str5 in dta 2. As such there is no pattern.
I wish to append the two dta files, and am aware of the problems of using the force option with the append command.

I tried to do what Clyde suggested. Here is my code of that. But this did not work.

use "Z:\Public Good\Bihar.dta" //dta 1 which is my master//

ds, has(type numeric)
local numeric_var "`n(varlist)'"

ds, has(type string)
local string_var "`s(varlist)'"

// LOCAL MACRO CONTAINING NAMES OF THE DATA FILES
local filenames "Z:\Public Good\UP.dta" //dta 2 which has to be appended to dta 1//
clear
tempfile bihar_up
save `bihar_up', emptyok
foreach f of local filenames {
destring `numeric_var', replace
tostring `string_var', replace
append using `bihar_up'
save `"`bihar_up'"', replace
}

Could anyone suggest?

Last edited by Priyoma Mustafi; 31 Jul 2018, 11:49.
Comment

Announcement

Appending different data types

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment