Listing only a small number of observations with a certain criteria

bsc.j.j.w

Join Date: May 2014

Posts: 50
#1

Listing only a small number of observations with a certain criteria

16 Aug 2014, 11:32

Dear all,

sometimes I need to quickly list a small number of examples satisfying a certain criteria. Normally I would do it like this:

Code:

list name year if age < 20

The problem is that I get a huge list and I constantly need to press q or something to cut it off. What is the official method if I just need to like list the first 20 people who are satisfying these conditions? Please note the small subtlety in the difference between the first 20 observation and the first 20 observations that satisfy this condition.

Thanks!
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

16 Aug 2014, 11:45

Code:

list name year in 1/30 if age <20

is a really simple approach that will get you nearly what you want. It will list those in the first 30 observations that match your condition. So depending, you might only get five cases, you might get twelve. If this does the job, great.

longer way, that will get you exactly what you want:

Code:

preserve keep if age<20 list name year in 1/20 restore
Comment
bsc.j.j.w

Join Date: May 2014

Posts: 50
#3

16 Aug 2014, 11:49

Ah yes, destroying the data and then restoring it. I was actually hoping for a method that did not do that or does that as a one liner. The first method you proposed cannot be done in my case because I'm not sure what the interval is where I should look. In addition if my interval is too big, I could find up to thousands of observations matching my criteria.

Thank you for your help Ben!
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

16 Aug 2014, 12:00

Well, maybe somebody will chime in with a better approach. But shouldn't be too bad to do the first approach trial-and-error, starting small and bumping the interval up until you get what you want. Of course, if the data gets sorted, it's not replicable, so not ideal.
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#5

16 Aug 2014, 12:56

This depends a great deal on exactly what you're trying to do.

For this example if you just want to see some cases where age <20 you have some choices.
One is to sort on age and then use -list if age<20 in 1/30- (or whatever number of observations you want to see). The problem then is that you'll get the smaller ages first and if you think whatever it is you're looking for in the data might vary by age, you probably don't want that. You could sort the variables meeting your criteria randomly and then list the first however many to get a broader bit of the data.

Code:

gen randsort=runiform() if age<20 sort randsort list in 1/30

This is totally replicable if you first sort on a unique combinations of variables and then set the seed before generating the random sort variable.
Of course it's still totally ad hoc, so the question of replicable vs. not may not be the most salient issue. If you're doing this as part of a debugging process the main issue is making sure you have a clear record of what decisions you made when writing your code and why. Which exact cases you looked at to make those decisions may or may not be relevant depending on the exact nature of the task.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

16 Aug 2014, 13:20

This is something that I bump into regularly and I prefer a solution that does not change the data and does not require sorting because this can be slow when you have lots of observations. Here's a little program that does just that

Code:

*! version 1.0.0, 16aug2014, Robert Picard, [email protected]      
program listsome

    version 11
    
    syntax [varlist] [if] [in] , ///
    [ ///
    MAXimum(integer 20) ///
    RANDom ///
    * ]
    
    marksample touse, novarlist
    qui count if `touse'
    local ntouse = r(N)
    
    if `ntouse' > `maximum' {
        
        if "`random'" != "" {
        
            tempvar random
            gen `random' = runiform() if `touse'
            
            local target = `maximum' / `ntouse'
            local more 1
            while `more' {
                qui count if `touse' & `random' < `target'
                if r(N) < `maximum' {
                    local target = `target' + 1 / `ntouse'
                }
                else local more 0                
            }
            
            qui replace `touse' = `touse' * `random' <= `target'
            
        }

        list `varlist' if `touse' & sum(`touse') <= `maximum', `options'

    }
    else {
    
        list `varlist' if `touse', `options'
        
    }
    
end

I find the -random- option particularly useful when using regular expressions on strings with millions of observations and I want a sample of what changes across a universe of observations that I cannot fully inspect visually. Here are some examples of how to use -listsome-

Code:

sysuse auto, clear

listsome price mpg rep78 if foreign, max(3)

* repeat but select observations at random
listsome price mpg rep78 if foreign, max(3) random

* repeat and add -list- options
listsome price mpg rep78 if foreign, max(6) random noobs sep(3)

* list variable differences
gen make2 = subinstr(make,".","",.)
listsome make* if make != make2, random max(3)

I'll put together a help file and upload it to SSC unless someone points out that another program already does this.

Last edited by Robert Picard; 16 Aug 2014, 13:47.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#7

16 Aug 2014, 19:36

There is another approach that will work if you just want to see some of the data, but you don't actually need to display the list in your output log. You can do -browse name year if are < 20-. The browser window will open with just those variables and just those observations. You'll see one screen's worth, and if you want to see more you can scroll down.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#8

18 Aug 2014, 12:54

Yes indeed, browse is usually the way to go when working interactively with data. But if you take an action based on what you just browsed, it makes sense to include in the log file a small sample of the values you observed before implementing the action. This adds transparency to the record.

For example, when doing data cleaning, you want to leave a record of everything you do that changes the original data. A desirable workflow would include
listing a number of observations that illustrates the problem at hand;

making a copy of the original variable to preserve the original;

implementing the solution, usually using a replace statement;

listing a number of observations that show the results of the change.

A slightly modified (from the one in #6) version of listsome is now available for download from SSC. See the announcement here.
1 like
Comment

Announcement