Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compress doesn't work

    Hi,
    I tried to use command -compress- to save space, but it didn't work. Here is description of my variables and apparently there are many space can be saved. Is there any possible way to compress all the variables?
    And I didn't set the format. The dataset was converted from .csv file directly.
    Any code or link about this problem is much appreciated.

    Code:
     . compress
      (0 bytes saved)
    
    . de
                              
    -------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -------------------------------------------------------------------------------------------------------------------------------------
    id       str184  %184s                  id
    type str91   %91s                   type
    city str46   %46s                  city

  • #2
    it is not clear why you think that "there are many spaces [that] can be saved" but here is my guess: you have "extra" spaces in your string variables; if that is the case, you don't want -compress- as that will not do what you think it does; rather you want -trim- and possibly some of its relatives; see
    Code:
    help string functions
    and search for "trim" - you will find several functions but I'm not sure what will be best for your data as you don't show any examples (please read the FAQ on using -dataex- to show examples)

    Comment


    • #3
      Hi Wenting.
      THe problem is not that compress "doesnt work". Is about how your data is stored.
      All your variables are stored with 184, 91 and 46 characters. Possibly there are blank spaces there. But "compress" does not eliminate those extra spaces.
      You need revise your data, and eliminate extra spaces of your variables first, before you try using "compress".
      search string trim
      THat may direct you to commands that may help eliminating those "extra" spaces, assuming that is the reason why those variables are so "long"

      Comment


      • #4
        Code:
        foreach s in id type city { 
             replace `s' = trim(itrim(`s')) 
        } 
        
        compress

        Comment


        • #5
          Thank you all for your advice. Here is my example data. I have tried to use "stritrim" , "strltrim" and "strrtrim" to eliminate extra spaces, but variable "id" only changed from str184 to str181. The variable "id" is less than 20 characters in theory.

          It contains 15,760,870 obs and I can't go through every obersavation. Is there any possible way to locate the longest row in variable "id" and other variables? Maybe I need to check them manually.
          Or any other better way to deal with it? Many thanks!
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str184 id str91 type str46 city
          "33336822635" "C" ""                  
          "3737279531"  "I" "webster"           
          "30798963917" "C" ""                  
          "33217813693" "C" "waco"              
          "30799007485" "C" "clinton"           
          "30798877930" "C" "st augustine"      
          "30798958797" "C" "adams"             
          "54101317357" "I" "minnetonka"        
          "33336822629" "C" ""                  
          "54027277569" "I" "brighton"          
          "30726443881" "C" "davenport"         
          "32848919014" "C" ""                  
          "33336822624" "C" ""                  
          "53710257403" "I" "salem"             
          "53490200887" "I" "elk grove"         
          "33060259151" "C" ""                  
          "30828505910" "C" "jacksonville beach"
          "33336822533" "C" ""                  
          "82773579332" "I" "san antonio"       
          "3846112566"  "C" "warren"            
          "27562499375" "C" "chicago"           
          "56066164828" "I" "falls church"      
          "30685224451" "C" "missouri valley"   
          "3015008009"  "I" "fargo"             
          "33336822546" "C" ""                  
          "3276175705"  "I" "bothell"           
          "31236442721" "C" ""                  
          "30814398649" "C" "houston"           
          "3846112564"  "C" "greenville"        
          "80287187312" "I" "casper"            
          "3842859976"  "C" "escanaba"          
          "53080023719" "I" ""                  
          "53822464151" "I" "albuquerque"       
          "52495763249" "I" "brooklyn"          
          "30685228980" "C" ""                  
          "32842857303" "C" "tustin"            
          "52552800717" "I" "macedon"           
          "53612215908" "I" "port orange"       
          "4096317607"  "I" "marietta"          
          "53305166800" "I" "eagle mountain"    
          "33336822576" "C" ""                  
          "2613838129"  "I" ""                  
          "30811105498" "C" "milwaukee"         
          "53151095332" "I" "watertown"         
          "32842890024" "C" ""                  
          "30814398624" "C" "syracuse"          
          "2269641906"  "I" "sweetwater"        
          "30798956552" "C" ""                  
          "4187397571"  "I" "laguna niguel"     
          "3186122473"  "I" "pace"              
          end
          Code:
          . replace id=stritrim(id)
          (26 real changes made)
          
          . compress
            variable id was str184 now str181
            (47,282,610 bytes saved)
          
          . replace id=strltrim(id)
          (3 real changes made)
          
          .  replace id=strrtrim(id)
          (0 real changes made)
          
          . compress
            (0 bytes saved)

          Comment


          • #6
            Many thanks Nick! Here is the result. There must be some outliers in my data. It a big dataset and I can't go through every observation. Is there any code to locate the longest raw in variable "id" or link about dealing with this kind of question? Thanks a lot!
            Code:
            . foreach s in id  type city { 
              2. 
            .      replace `s' = trim(itrim(`s')) 
              3. 
            . } 
            (26 real changes made)
            (0 real changes made)
            (0 real changes made)
            (7,789 real changes made)
             
            . 
            . 
            . compress
              variable bonicacid was str184 now str181
              (47,282,610 bytes saved)
            
            .

            Comment


            • #7
              So, id shouldn't be longer than 20.

              Code:
              count if strlen(id) > 20
              tells you how many observations exceed that. That's the first step. If there are very many of them, some misunderstanding might be involved.

              Code:
              list id if  strlen(id) > 20 
              
              edit id  if strlen(id)  > 20
              lets you look at their values. More positively, the problem could be something very specific, such as metadata read in from the first or last records in a spreadsheet.

              As an long-term user of Stata, I still tend to type
              length() not strlen(): the former is now undocumented, but still works.

              Comment


              • #8
                Thanks Nick! Your code works perfectly (There are 50 observations reading other columns' content into one variable)
                Many thanks for your help! Have a nice day🙂

                Comment

                Working...
                X