Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using seed with sample command

    I am trying to draw a simple random sample without replacement. I have 6 variables in my dataset including, Region, Department, Village, School, School ID and Number of School Members. I want to select one observation (school) per village in my sample. While I am able to obtain a sample, I am not able to use seed to draw the same sample every time. Can someone help me rectify the problem?
    Here is the command I am using:

    sort Villages
    set seed 7654
    by Villages: sample 1, count

  • #2
    Welcome to the Stata Forum / Statalist.

    Please take some time to read the FAQ, particularly on how to share data/command/output.

    That being said, this is a toy example, and I keep find the same sample:

    Code:
    . sysuse auto
    (1978 Automobile Data)
    
    . preserve
    
    . set seed 1234
    
    . by foreign, sort: sample 1, count
    (72 observations deleted)
    
    . list mpg price foreign
    
         +------------------------+
         | mpg   price    foreign |
         |------------------------|
      1. |  18   4,516   Domestic |
      2. |  17   9,690    Foreign |
         +------------------------+
    
    . restore
    
    . set seed 1234
    
    . by foreign, sort: sample 1, count
    (72 observations deleted)
    
    . list mpg price foreign
    
         +------------------------+
         | mpg   price    foreign |
         |------------------------|
      1. |  18   4,516   Domestic |
      2. |  17   9,690    Foreign |
         +------------------------+
    Best regards,

    Marcos

    Comment


    • #3
      How about

      Code:
      sort Villages
      set seed 7654
      sample 1, count by(Villages)
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      StataNow Version: 19.5 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Actually, I think your original code works too. Are you sure that it doesn't? Paste your actual code and output so we can see what you are talking about.

        If you, say, ran it once, then reopened the data set, you would need to set the seed again to reproduce the results.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Hmm! I can replicate this problem (using the auto.dta and sampling by rep78). After experimenting a bit, the problem seems to be the use of the -by- prefix. If you drop that and use the -by()- option instead, it will work give reproducible results. You also need to get rid of the -sort Villages- command.

          Here's what's going on:

          Whenever Stata -sort-s data on a sort key that does not uniquely identify the observations in the data, the observations are in sort order on the sort key but are randomized within those sort key-defined groups. This randomization is not reproducible. So your -sort Villages- command effectively shuffles the data into a partially random order, and each time you run -sample- you get different results because the results of -sample- are based, in part, on the order of the data. You cannot resolve this problem by just removing -sort Villages-: then Stata will reject the -by- prefix due to the lack of sorting. If you incorporate the sort into -by Villages, sort:- or -bysort Villages-, you just have the same problem: you are randomizing the order of the data before -sample- itself gets its hands on it, and -sample- doesn't seem to be able to overcome that.

          Using
          Code:
          set seed 7654
          sample 1, count by(Villages)
          will give you reproducible sampling.

          If you run -viewsource sample.ado-, and scrool to near the end of the file, you will see some comments by Bill Gould about the problem of reproducibility of sampling with a -by()- option. Apparently before version 7, it was not at all reproducible, but the problem hadn't surfaced in practice. At version 7, they went to great lengths to assure that -sample, by()- would produce reproducible results. But apparently, that fix didn't carry over to the -by- prefix. I suspect that they didn't eliminate the possibility of using the -by- prefix because they were concerned about backward compatibility. But, in retrospect, perhaps they should have.

          Added: Crossed with #2, 3, 4.
          Last edited by Clyde Schechter; 01 Dec 2017, 14:45.

          Comment


          • #6
            Re #2 and #4, Ms. Green is correct. Here is an example of an irreproducible result with -by: sample-.

            Code:
            . forvalues i = 1/5 {
              2.         sysuse auto, clear
              3.         set seed 7654
              4.         by rep78, sort: sample 1, count
              5.         list, noobs clean
              6. }
            (1978 Automobile Data)
            (68 observations deleted)
            
                make              price   mpg   rep78   headroom   trunk   weight   length   turn   displa~t   gear_r~o    foreign  
                Pont. Firebird    4,934    18       1        1.5       7    3,470      198     42        231       3.08   Domestic  
                Pont. Sunbird     4,172    24       2        2.0       7    2,690      179     41        151       2.73   Domestic  
                Pont. Le Mans     4,723    19       3        3.5      17    3,200      199     40        231       2.93   Domestic  
                Datsun 810        8,129    21       4        2.5       8    2,750      184     38        146       3.55    Foreign  
                Toyota Corona     5,719    18       5        2.0      11    2,670      175     36        134       3.05    Foreign  
                Peugeot 604      12,990    14       .        3.5      14    3,420      192     38        163       3.58    Foreign  
            (1978 Automobile Data)
            (68 observations deleted)
            
                make                price   mpg   rep78   headroom   trunk   weight   length   turn   displa~t   gear_r~o    foreign  
                Olds Starfire       4,195    24       1        2.0      10    2,730      180     40        151       2.73   Domestic  
                Chev. Monte Carlo   5,104    22       2        2.0      16    3,220      200     41        200       2.73   Domestic  
                Olds Cutlass        4,733    19       3        4.5      16    3,300      198     42        231       2.93   Domestic  
                Mazda GLC           3,995    30       4        3.5      11    1,980      154     33         86       3.73    Foreign  
                Dodge Colt          3,984    30       5        2.0       8    2,120      163     35         98       3.54   Domestic  
                Buick Opel          4,453    26       .        3.0      10    2,230      170     34        304       2.87   Domestic  
            (1978 Automobile Data)
            (68 observations deleted)
            
                make                 price   mpg   rep78   headroom   trunk   weight   length   turn   displa~t   gear_r~o    foreign  
                Pont. Firebird       4,934    18       1        1.5       7    3,470      198     42        231       3.08   Domestic  
                Chev. Monte Carlo    5,104    22       2        2.0      16    3,220      200     41        200       2.73   Domestic  
                Chev. Nova           3,955    19       3        3.5      13    3,430      197     43        250       2.56   Domestic  
                VW Rabbit            4,697    25       4        3.0      15    1,930      155     35         89       3.78    Foreign  
                Toyota Celica        5,899    18       5        2.5      14    2,410      174     36        134       3.06    Foreign  
                Peugeot 604         12,990    14       .        3.5      14    3,420      192     38        163       3.58    Foreign  
            (1978 Automobile Data)
            (68 observations deleted)
            
                make             price   mpg   rep78   headroom   trunk   weight   length   turn   displa~t   gear_r~o    foreign  
                Olds Starfire    4,195    24       1        2.0      10    2,730      180     40        151       2.73   Domestic  
                Cad. Eldorado   14,500    14       2        3.5      16    3,900      204     43        350       2.19   Domestic  
                Cad. Deville    11,385    14       3        4.0      20    4,330      221     44        425       2.28   Domestic  
                Honda Civic      4,499    28       4        2.5       5    1,760      149     34         91       3.30    Foreign  
                Honda Accord     5,799    25       5        3.0      10    2,240      172     36        107       3.05    Foreign  
                Plym. Sapporo    6,486    26       .        1.5       8    2,520      182     38        119       3.54   Domestic  
            (1978 Automobile Data)
            (68 observations deleted)
            
                make             price   mpg   rep78   headroom   trunk   weight   length   turn   displa~t   gear_r~o    foreign  
                Olds Starfire    4,195    24       1        2.0      10    2,730      180     40        151       2.73   Domestic  
                Dodge Diplomat   4,010    18       2        4.0      17    3,600      206     46        318       2.47   Domestic  
                AMC Pacer        4,749    17       3        3.0      11    3,350      173     40        258       2.53   Domestic  
                VW Dasher        7,140    23       4        2.5      12    2,160      172     36         97       3.74    Foreign  
                Toyota Corolla   3,748    31       5        3.0       9    2,200      165     35         97       3.21    Foreign  
                Buick Opel       4,453    26       .        3.0      10    2,230      170     34        304       2.87   Domestic
            This problem does not arise with the -by()- option, however.

            Comment


            • #7
              Re #2, 4: Rachel has it right. The results with -by:- are not reproducible, though those with -by()- are. Run this to see:
              Code:
              forvalues i = 1/5 {
                  sysuse auto, clear
                  set seed 7654
                  by rep78, sort: sample 1, count
                  list, noobs clean
              }
              Note: I ran this and tried to post the output, to save you the trouble, but for some reason that triggers the Forum's spam filter.

              Comment


              • #8
                Looking back at the original post, the sort was done BEFORE setting the seed. So I wouldn't expect the results to be the same. As for all these variations, I am getting confused now! But using the infamous stable option makes it work:

                Code:
                forvalues i = 1/5 {
                    sysuse auto, clear
                    set seed 7654
                    sort rep78, stable
                    by rep78: sample 1, count
                    list, noobs clean
                }
                Given that you set seed though, I am not sure why stable is necessary.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Clyde is spot on with respect to the indeterminate sort order within by-groups when the sort key(s) do not fully identify observations. When sample does its thing, it does so based on the current sort order. So for example, the following will result in different samples at each iteration:
                  Code:
                  forvalues i = 1/5 {
                      set seed 76543
                      sysuse auto, clear
                      sort rep78
                      sample 1, count
                      list make price, noobs clean
                  }
                  With the by() option, the data has to be sorted but sample does it by preserving the original sort order within each by-group (i.e. the sort is made stable). The solution when using the by: prefix is to fully sort the data before calling sample.
                  Code:
                  forvalues i = 1/5 {
                      sysuse auto, clear
                      set seed 76543
                      bysort rep78 (make): sample 1, count
                      list make price, noobs clean
                  }
                  A "better" way to do this is to use the isid command to make sure that the sort keys uniquely identify observations (and leave the data sorted):
                  Code:
                  forvalues i = 1/5 {
                      sysuse auto, clear
                      set seed 76543
                      isid rep78 make, sort missok
                      by rep78: sample 1, count
                      list make price, noobs clean
                  }

                  Comment

                  Working...
                  X