importing correct data type

tim bradshaw

Join Date: Feb 2016

Posts: 192
#1

importing correct data type

02 Sep 2022, 07:28

Hi there - sorry if this has been asked before.
When importing a huge dataset (e.g. csv), does Stata "guess" the correct data type for each column by reading the first x number of rows? Or does it scan the entire dataset before determining the correct data type for each column?
Thanks
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

02 Sep 2022, 08:51

The exact rules do not seem to be documented.

I would guess that since Stata has to read each line as text in order to separate the variables, it first reads all the data, creates each variable as a string, and then decides whether to convert each variable to numeric based on the values seen in the entire dataset.

But do note that you can use the numericcols() and stringcols() options to tell Stata the choice you want made for particular columns.
Comment
tim bradshaw

Join Date: Feb 2016

Posts: 192
#3

02 Sep 2022, 10:00

Thank you very much. The stringcols option is handy but for my dataset would require a lot of manual effort. But my concern is that I might be risking losing data if I leave it to the default.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Sep 2022, 10:45

The example below demonstrates importing a csv dataset as strings and the checking each variable (other than those I identified as string variables) to see what non-numeric values it has.

Code:

. type "~/Downloads/example.csv"
id,dog,cat,fish
a,21,31,41
b,22,na,42
c,23,33,43

. import delimited "~/Downloads/example.csv", varnames(1) stringcols(_all) clear
(encoding automatically selected: ISO-8859-2)
(4 vars, 3 obs)

. list, clean

       id   dog   cat   fish  
  1.    a    21    31     41  
  2.    b    22    na     42  
  3.    c    23    33     43  

. local strings id

. ds `strings', not
dog   cat   fish

. local numeric `r(varlist)'

. foreach var of local numeric {
  2.     display _newline "Variable: `var'"
  3.     tab `var' if missing(real(`var'))
  4. }

Variable: dog
no observations

Variable: cat

        cat |      Freq.     Percent        Cum.
------------+-----------------------------------
         na |          1      100.00      100.00
------------+-----------------------------------
      Total |          1      100.00

Variable: fish
no observations

.

I would not import any dataset too large to visually scan without first doing an exploration like this. Doing so would be unprofessional.

Step 2 would be

Code:

replace cat = ".n" if cat=="na"

so that when it is converted to real, the result will be the missing value .n which is what I use to indicate "na" values in my data. If "na" were common to multiple variables, I would just loop over the entire list of variables and do that to each.

Step 3 would be

Code:

destring `numeric', replace

I don't see a lot of manual effort here.

Comment

tim bradshaw

Join Date: Feb 2016

Posts: 192
#5

03 Sep 2022, 12:59

Thank you again for your attention to this, very helpful!
Comment

Announcement