Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Listing only a small number of observations with a certain criteria

    Dear all,

    sometimes I need to quickly list a small number of examples satisfying a certain criteria. Normally I would do it like this:

    Code:
    list name year if age < 20
    The problem is that I get a huge list and I constantly need to press q or something to cut it off. What is the official method if I just need to like list the first 20 people who are satisfying these conditions? Please note the small subtlety in the difference between the first 20 observation and the first 20 observations that satisfy this condition.

    Thanks!

  • #2
    Code:
    list name year in 1/30 if age <20
    is a really simple approach that will get you nearly what you want. It will list those in the first 30 observations that match your condition. So depending, you might only get five cases, you might get twelve. If this does the job, great.

    longer way, that will get you exactly what you want:
    Code:
    preserve
    keep if age<20
    list name year in 1/20
    restore

    Comment


    • #3
      Ah yes, destroying the data and then restoring it. I was actually hoping for a method that did not do that or does that as a one liner. The first method you proposed cannot be done in my case because I'm not sure what the interval is where I should look. In addition if my interval is too big, I could find up to thousands of observations matching my criteria.

      Thank you for your help Ben!

      Comment


      • #4
        Well, maybe somebody will chime in with a better approach. But shouldn't be too bad to do the first approach trial-and-error, starting small and bumping the interval up until you get what you want. Of course, if the data gets sorted, it's not replicable, so not ideal.

        Comment


        • #5
          This depends a great deal on exactly what you're trying to do.

          For this example if you just want to see some cases where age <20 you have some choices.
          One is to sort on age and then use -list if age<20 in 1/30- (or whatever number of observations you want to see). The problem then is that you'll get the smaller ages first and if you think whatever it is you're looking for in the data might vary by age, you probably don't want that. You could sort the variables meeting your criteria randomly and then list the first however many to get a broader bit of the data.

          Code:
           gen randsort=runiform() if age<20
          sort randsort
          list in 1/30
          This is totally replicable if you first sort on a unique combinations of variables and then set the seed before generating the random sort variable.
          Of course it's still totally ad hoc, so the question of replicable vs. not may not be the most salient issue. If you're doing this as part of a debugging process the main issue is making sure you have a clear record of what decisions you made when writing your code and why. Which exact cases you looked at to make those decisions may or may not be relevant depending on the exact nature of the task.

          Comment


          • #6
            This is something that I bump into regularly and I prefer a solution that does not change the data and does not require sorting because this can be slow when you have lots of observations. Here's a little program that does just that

            Code:
            *! version 1.0.0, 16aug2014, Robert Picard, [email protected]      
            program listsome
            
                version 11
                
                syntax [varlist] [if] [in] , ///
                [ ///
                MAXimum(integer 20) ///
                RANDom ///
                * ]
                
                marksample touse, novarlist
                qui count if `touse'
                local ntouse = r(N)
                
                if `ntouse' > `maximum' {
                    
                    if "`random'" != "" {
                    
                        tempvar random
                        gen `random' = runiform() if `touse'
                        
                        local target = `maximum' / `ntouse'
                        local more 1
                        while `more' {
                            qui count if `touse' & `random' < `target'
                            if r(N) < `maximum' {
                                local target = `target' + 1 / `ntouse'
                            }
                            else local more 0                
                        }
                        
                        qui replace `touse' = `touse' * `random' <= `target'
                        
                    }
            
                    list `varlist' if `touse' & sum(`touse') <= `maximum', `options'
            
                }
                else {
                
                    list `varlist' if `touse', `options'
                    
                }
                
            end
            I find the -random- option particularly useful when using regular expressions on strings with millions of observations and I want a sample of what changes across a universe of observations that I cannot fully inspect visually. Here are some examples of how to use -listsome-

            Code:
            sysuse auto, clear
            
            listsome price mpg rep78 if foreign, max(3)
            
            * repeat but select observations at random
            listsome price mpg rep78 if foreign, max(3) random
            
            * repeat and add -list- options
            listsome price mpg rep78 if foreign, max(6) random noobs sep(3)
            
            * list variable differences
            gen make2 = subinstr(make,".","",.)
            listsome make* if make != make2, random max(3)
            I'll put together a help file and upload it to SSC unless someone points out that another program already does this.
            Last edited by Robert Picard; 16 Aug 2014, 13:47.

            Comment


            • #7
              There is another approach that will work if you just want to see some of the data, but you don't actually need to display the list in your output log. You can do -browse name year if are < 20-. The browser window will open with just those variables and just those observations. You'll see one screen's worth, and if you want to see more you can scroll down.

              Comment


              • #8
                Yes indeed, browse is usually the way to go when working interactively with data. But if you take an action based on what you just browsed, it makes sense to include in the log file a small sample of the values you observed before implementing the action. This adds transparency to the record.

                For example, when doing data cleaning, you want to leave a record of everything you do that changes the original data. A desirable workflow would include
                • listing a number of observations that illustrates the problem at hand;
                • making a copy of the original variable to preserve the original;
                • implementing the solution, usually using a replace statement;
                • listing a number of observations that show the results of the change.
                A slightly modified (from the one in #6) version of listsome is now available for download from SSC. See the announcement here.

                Comment

                Working...
                X