Compress doesn't work

Wenting Li

Join Date: Jun 2019
Posts: 11

Compress doesn't work

29 Apr 2020, 06:59

Hi,
I tried to use command -compress- to save space, but it didn't work. Here is description of my variables and apparently there are many space can be saved. Is there any possible way to compress all the variables?
And I didn't set the format. The dataset was converted from .csv file directly.
Any code or link about this problem is much appreciated.

Code:

 . compress
  (0 bytes saved)

. de
                          
-------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------------------------
id       str184  %184s                  id
type str91   %91s                   type
city str46   %46s                  city

Tags: string, syntax

Rich Goldstein

Join Date: Mar 2014

Posts: 4454
#2

29 Apr 2020, 07:17

it is not clear why you think that "there are many spaces [that] can be saved" but here is my guess: you have "extra" spaces in your string variables; if that is the case, you don't want -compress- as that will not do what you think it does; rather you want -trim- and possibly some of its relatives; see

Code:

help string functions

and search for "trim" - you will find several functions but I'm not sure what will be best for your data as you don't show any examples (please read the FAQ on using -dataex- to show examples)
1 like
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2457
#3

29 Apr 2020, 07:20

Hi Wenting.
THe problem is not that compress "doesnt work". Is about how your data is stored.
All your variables are stored with 184, 91 and 46 characters. Possibly there are blank spaces there. But "compress" does not eliminate those extra spaces.
You need revise your data, and eliminate extra spaces of your variables first, before you try using "compress".
search string trim
THat may direct you to commands that may help eliminating those "extra" spaces, assuming that is the reason why those variables are so "long"
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#4

29 Apr 2020, 08:33

Code:

foreach s in id type city { replace `s' = trim(itrim(`s')) } compress
1 like
Comment

Wenting Li

Join Date: Jun 2019
Posts: 11

29 Apr 2020, 08:49

Thank you all for your advice. Here is my example data. I have tried to use "stritrim" , "strltrim" and "strrtrim" to eliminate extra spaces, but variable "id" only changed from str184 to str181. The variable "id" is less than 20 characters in theory.

It contains 15,760,870 obs and I can't go through every obersavation. Is there any possible way to locate the longest row in variable "id" and other variables? Maybe I need to check them manually.
Or any other better way to deal with it? Many thanks!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str184 id str91 type str46 city
"33336822635" "C" ""                  
"3737279531"  "I" "webster"           
"30798963917" "C" ""                  
"33217813693" "C" "waco"              
"30799007485" "C" "clinton"           
"30798877930" "C" "st augustine"      
"30798958797" "C" "adams"             
"54101317357" "I" "minnetonka"        
"33336822629" "C" ""                  
"54027277569" "I" "brighton"          
"30726443881" "C" "davenport"         
"32848919014" "C" ""                  
"33336822624" "C" ""                  
"53710257403" "I" "salem"             
"53490200887" "I" "elk grove"         
"33060259151" "C" ""                  
"30828505910" "C" "jacksonville beach"
"33336822533" "C" ""                  
"82773579332" "I" "san antonio"       
"3846112566"  "C" "warren"            
"27562499375" "C" "chicago"           
"56066164828" "I" "falls church"      
"30685224451" "C" "missouri valley"   
"3015008009"  "I" "fargo"             
"33336822546" "C" ""                  
"3276175705"  "I" "bothell"           
"31236442721" "C" ""                  
"30814398649" "C" "houston"           
"3846112564"  "C" "greenville"        
"80287187312" "I" "casper"            
"3842859976"  "C" "escanaba"          
"53080023719" "I" ""                  
"53822464151" "I" "albuquerque"       
"52495763249" "I" "brooklyn"          
"30685228980" "C" ""                  
"32842857303" "C" "tustin"            
"52552800717" "I" "macedon"           
"53612215908" "I" "port orange"       
"4096317607"  "I" "marietta"          
"53305166800" "I" "eagle mountain"    
"33336822576" "C" ""                  
"2613838129"  "I" ""                  
"30811105498" "C" "milwaukee"         
"53151095332" "I" "watertown"         
"32842890024" "C" ""                  
"30814398624" "C" "syracuse"          
"2269641906"  "I" "sweetwater"        
"30798956552" "C" ""                  
"4187397571"  "I" "laguna niguel"     
"3186122473"  "I" "pace"              
end

Code:

. replace id=stritrim(id)
(26 real changes made)

. compress
  variable id was str184 now str181
  (47,282,610 bytes saved)

. replace id=strltrim(id)
(3 real changes made)

.  replace id=strrtrim(id)
(0 real changes made)

. compress
  (0 bytes saved)

Comment

Wenting Li

Join Date: Jun 2019

Posts: 11
#6

29 Apr 2020, 09:03

Many thanks Nick! Here is the result. There must be some outliers in my data. It a big dataset and I can't go through every observation. Is there any code to locate the longest raw in variable "id" or link about dealing with this kind of question? Thanks a lot!

Code:

. foreach s in id type city { 2. . replace `s' = trim(itrim(`s')) 3. . } (26 real changes made) (0 real changes made) (0 real changes made) (7,789 real changes made) . . . compress variable bonicacid was str184 now str181 (47,282,610 bytes saved) .
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#7

29 Apr 2020, 09:23

So, id shouldn't be longer than 20.

Code:

count if strlen(id) > 20

tells you how many observations exceed that. That's the first step. If there are very many of them, some misunderstanding might be involved.

Code:

list id if strlen(id) > 20 edit id if strlen(id) > 20

lets you look at their values. More positively, the problem could be something very specific, such as metadata read in from the first or last records in a spreadsheet.

As an long-term user of Stata, I still tend to type length() not strlen(): the former is now undocumented, but still works.
2 likes
Comment
Wenting Li

Join Date: Jun 2019

Posts: 11
#8

29 Apr 2020, 22:11

Thanks Nick! Your code works perfectly (There are 50 observations reading other columns' content into one variable)
Many thanks for your help! Have a nice day🙂
Comment

Announcement

Compress doesn't work

Comment

Comment

Comment

Comment

Comment

Comment

Comment