Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Set Seed Random Sample

    I would like to obtain a nationally representative sample from a (non representative) survey data set thereby using every individual from the given survey data set. To do so I am creating a 100% random sample with the corresponding weights by using gsample 100 [aw=weight], percent. I am not expanding the data by using the weights first and then drawing a random sample with other commands (e.g. sample) since the data set that I start with is already quite big. The problem is that I am not getting the same sample every time I draw the sample. I tried to set the seed at various positions, at the beginning of the do-file or right before the gsample command. I also tried to sort the data according to a unique identifier right before drawing the sample and also setting the seed right before that. But nothing has helped to have the same random sample when I draw the sample again and again. Any help would be appreciated. I cannot post my code before using gsample command since it is quite long, but some commands involve sorting the data.

  • #2
    One possibility is that your use of -sort- results in groups of observations that are tied on the sorting variable, in which case Stata randomizes their order. (See -help sort-.) Check out the -stable- option on -sort-, which might solve your problem. Another possibility is to create a random variable, and include that in the sort along with your other sort keys, which would ensure a distinct and reproducible order.

    You should only set the seed once, which I like to do at the top of my do-file, but all that should matter is that it occurs before anything with a random component is done, which includes sort (or any program that calls sort), and gsample. If this doesn't work for you, I'd suggest that you would need to create a simplified example of your code and post it, as your description, at least to me, is relatively obscure.

    Comment


    • #3
      Thanks for the reply Mike!

      Why does setting the seed at the very beginning of the do-file not work for the commands that sort the data? I have several commands where I bysort the data. I thought that setting the seed before all of these sorting commands would provide the same sorting order and hence the same results over and over again. It would be quite tedious to use the stable option for each of those, especially since bysort does not allow explicitly the stable option (if I am not wrong). I tried sorting the data with stable option right before setting the seed which is right before gsample command, but it didn't help either.

      Comment


      • #4
        I don't know why setting the seed doesn't solve the uncertainty of -sort-. You'd think it would. I'd guess that this issue has come up on StataList before, so some searching of the archives might lead to an explanation.

        -bysort- can be replaced with
        Code:
        sort x, stable
        by x: do something
        With a moderately sophisticated text editor, it's likely possible to search and replace all of your -bysort- commands with something like the preceding.

        However, the fact that your whole program depends on sort order in this way seems a little funny to me, and I wonder if there is some other kind of approach to your larger goals that would avoid this. Without seeing some kind of example so as to better understand what you are doing, I don't have any good ideas for you, but I suspect someone else might.

        Comment


        • #5
          Let's just wait and see if someone has any insights about why setting the seed at the very beginning of the do-file doesn't work. I did some google search about this issue and found a post where Nick Cox suggests to use one set seed at the very beginning. So I followed that. If no one provides additional answers, I will rerun all the sort commands with the stable option and see what it brings. I would still prefer the more elegant way to set seed once.

          Comment


          • #6
            Seems to be intentional behavior, see this Statalist posting: https://www.stata.com/statalist/arch.../msg00817.html

            Comment


            • #7
              Thanks Anders. sortseed worked. Are there any disadvantages by using set sortseed?

              Comment

              Working...
              X