Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coding variables to make the database lighter

    Hi, I'm working with a big database (15 million of observations, 2 gb) and I would like to make it as lighter as possible. The only ways that I come up to do that is code string variables (i.e. gender: "male" vs. "female" as 0 and 1). However it's unclear to me which role values' labels plays. Does a label 0 = "Male", makes it heaver? I'm asking this because I have for example a variable "country" with more than 200 possible values. Furthermore, is there any other trick or thing to consider in order to make a database lighter?

    Cheers

  • #2
    In any real dataset, using value labels to encode variables as in your example of gender is going to save you disk space. The dataset needs a small bit of space to store the actual labels, but there are big cost savings by replacing strings with labelled numeric values. Stata is pretty good about compressing numeric values to be the smallest container they need, as well as shrinking strings when they are created to be longer than the longest string. You can read about -compress- which will do this.

    Comment


    • #3
      Great thanks, I'll read about compress. just to know if I got it, then it's better to have 0 and 1 labeled rather than "male" and "female", right?
      Last edited by Jean Jacques; 23 Mar 2021, 10:47.

      Comment


      • #4
        I have found from experience that just trying and seeing what happens in Stata takes less time that soul searching and reading the manuals. Here:

        Code:
        . webuse nlswork, clear
        (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
        
        . keep race
        
        . tab race
        
               race |      Freq.     Percent        Cum.
        ------------+-----------------------------------
              white |     20,180       70.72       70.72
              black |      8,051       28.22       98.94
              other |        303        1.06      100.00
        ------------+-----------------------------------
              Total |     28,534      100.00
        
        . des
        
        Contains data from http://www.stata-press.com/data/r15/nlswork.dta
          obs:        28,534                          National Longitudinal Survey.  Young Women 14-26 years of
                                                        age in 1968
         vars:             1                          27 Nov 2016 08:14
         size:        28,534                          
        ----------------------------------------------------------------------------------------------------------
                      storage   display    value
        variable name   type    format     label      variable label
        ----------------------------------------------------------------------------------------------------------
        race            byte    %8.0g      racelbl    race
        ----------------------------------------------------------------------------------------------------------
        Sorted by: 
             Note: Dataset has changed since last saved.
        
        . gen newrace = "White"
        
        . replace newrace = "Black" if race==2
        (8,051 real changes made)
        
        . replace newrace = "Other" if race==3
        (303 real changes made)
        
        . keep newrace
        
        . compress
          (0 bytes saved)
        
        . des
        
        Contains data from http://www.stata-press.com/data/r15/nlswork.dta
          obs:        28,534                          National Longitudinal Survey.  Young Women 14-26 years of
                                                        age in 1968
         vars:             1                          27 Nov 2016 08:14
         size:       142,670                          
        ----------------------------------------------------------------------------------------------------------
                      storage   display    value
        variable name   type    format     label      variable label
        ----------------------------------------------------------------------------------------------------------
        newrace         str5    %9s                   
        ----------------------------------------------------------------------------------------------------------
        Sorted by: 
             Note: Dataset has changed since last saved.
        
        .

        Comment


        • #5
          Yes. The numbering you choose can change depending on how you will use them for analysis (for example, you could code gender as 1/2 instead of 0/1), but the effect is the same in terms of dataset size and storage.

          Originally posted by Jean Jacques View Post
          Great thanks, I'll read about compress. just to know if I got it, then it's better to have 0 and 1 labeled rather than "male" and "female", right?

          Comment

          Working...
          X