Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reducing the size of your database.

    Dear all,
    I am having troubles in "cleaning" my database, whose initial size is 22GBs (71 millions rows and 9 variables - 7 of which are strL variables).
    I first run this code to detect which observation of my string variables are made of a "crazy high number" of characters.

    Code:
    * Tabulate number of characters of each obs for each string variable:
    foreach var of varlist _all {
        local variable_type: type `var'
        if substr("`variable_type'",1,3)=="str" {
            
            display in smcl as text "`var'"
            gen long l_var = length(`var') if !missing(`var')
            tab1 l_var, m
            drop l_var
        }
    }
    From the output of the above code, I learn that only very few observations per strL variable are made of a "crazy high number" of characters. For instance, for one of my string variables, 99.9999999% of its observations have less than 30 characters, while the remaining ones have 2000 characters).
    The presence of those few "crazy" observations force Stata to store that variable in strL format, even if the most of the observations could have been stored with a less heavy format, say str30, thus saving disk space and boosting efficiency.

    Therefore, I run the following code to detect and drop those observations which are "crazy", that is, whose number of characters exceed a certain arbitrary threshold.
    Code:
    gen long l_var = length(var1) if !missing(var1)
    drop if l_var>85 & l_var !=.
    gen str85 X = var1
    drop var1 l_var
    rename X var1
    Then I save my "cleaned" database. Outcome:
    1. The database now weighs more than the initial one, namely 27GBs.
    2. The database rather than having missing values has empty cells.
    Anybody knows what is going on?

    Any advice and help is much appreciated.
    Cordially,
    Edoardo

  • #2
    Instead of your second block of code, try this:

    Code:
    gen long l_var = length(var1) if !missing(var1)
    drop if l_var > 85 & !missing(l_var)
    compress l_var
    The compress command will, I think do the trick for you.

    Comment


    • #3
      Hi Clyde,
      thanks for the heads-up. It works! I was not aware of the "compress" command.

      Best,
      Edoardo

      Comment

      Working...
      X