Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • importing correct data type

    Hi there - sorry if this has been asked before.
    When importing a huge dataset (e.g. csv), does Stata "guess" the correct data type for each column by reading the first x number of rows? Or does it scan the entire dataset before determining the correct data type for each column?
    Thanks

  • #2
    The exact rules do not seem to be documented.

    I would guess that since Stata has to read each line as text in order to separate the variables, it first reads all the data, creates each variable as a string, and then decides whether to convert each variable to numeric based on the values seen in the entire dataset.

    But do note that you can use the numericcols() and stringcols() options to tell Stata the choice you want made for particular columns.

    Comment


    • #3
      Thank you very much. The stringcols option is handy but for my dataset would require a lot of manual effort. But my concern is that I might be risking losing data if I leave it to the default.

      Comment


      • #4
        The example below demonstrates importing a csv dataset as strings and the checking each variable (other than those I identified as string variables) to see what non-numeric values it has.
        Code:
        . type "~/Downloads/example.csv"
        id,dog,cat,fish
        a,21,31,41
        b,22,na,42
        c,23,33,43
        
        . import delimited "~/Downloads/example.csv", varnames(1) stringcols(_all) clear
        (encoding automatically selected: ISO-8859-2)
        (4 vars, 3 obs)
        
        . list, clean
        
               id   dog   cat   fish  
          1.    a    21    31     41  
          2.    b    22    na     42  
          3.    c    23    33     43  
        
        . local strings id
        
        . ds `strings', not
        dog   cat   fish
        
        . local numeric `r(varlist)'
        
        . foreach var of local numeric {
          2.     display _newline "Variable: `var'"
          3.     tab `var' if missing(real(`var'))
          4. }
        
        Variable: dog
        no observations
        
        Variable: cat
        
                cat |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                 na |          1      100.00      100.00
        ------------+-----------------------------------
              Total |          1      100.00
        
        Variable: fish
        no observations
        
        .
        I would not import any dataset too large to visually scan without first doing an exploration like this. Doing so would be unprofessional.

        Step 2 would be
        Code:
        replace cat = ".n" if cat=="na"
        so that when it is converted to real, the result will be the missing value .n which is what I use to indicate "na" values in my data. If "na" were common to multiple variables, I would just loop over the entire list of variables and do that to each.

        Step 3 would be
        Code:
        destring `numeric', replace
        I don't see a lot of manual effort here.

        Comment


        • #5
          Thank you again for your attention to this, very helpful!

          Comment

          Working...
          X