Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I draw two samples from dataset?

    I am drawing a random sample (1/3) form a data set to predict an outcome variable using a probit regression.
    Second, I am trying to use the estimated coefficients to predict the same outcome variable for the remaining observations(2/3).
    Here, I've been having issues to only use the remaining observations instead of generating another random sample that would then also include observations from the first sample.
    After having predicted the outcome variables of the second sample, it has to be divided at the median whereby the above median observations and below median observations will be allocated a dummy variable.

    sample 33.333

    dprobit call....

    estimates store dprobit

    predict callpredicted

    summarize callpredicted, detail

    egen p50=pctile(callpredicted)

    restore


    sum callpredicted if callpredicted<p50

    sum callpredicted if callpredicted>p50


    Thank you.
    Last edited by Mary Brown; 01 Mar 2019, 08:24. Reason: stata, sample, random sample, draw sample, sample groups, median, predicted values, observations, estimations

  • #2
    This can be done more simply. There is no need to chop up the data into sample and complementary sample for what you are doing: you can keep all the data in memory and just designate the sample with an indicator variable. Then you run the -dprobit- on the data from the sample only. Then run -predict-, which by default will also apply to the out-of-sample observations. Then you can use the -xtile- command to split the out-of-sample data at the out-of-sample median:

    Code:
    //    IDENTIFY A RANDOM 1/3 SAMPLE OF THE DATA
    set seed 1234 // OR YOUR FAVORITE SEED
    gen double shuffle = runiform()
    sort shuffle
    gen byte in_sample = (3*_n <= _N)
    
    //    FIT MODEL IN THE 1/3 SAMPLE
    dprobit call ... if in_sample
    estimates store deprobit
    
    //    USE RESULTS OF DPROBIT TO PREDICT OUTCOMES
    predict callpredicted    // NOTE: INCLUDES OUT-OF-SAMPLE OBSERVATIONS BY DEFAULT
    
    //    SPLIT THE OUT-OF-SAMPLE RESULTS AT THE (OUT-OF-SAMPLE) MEDIAN
    xtile group = callpredicted if !in_sample, nq(2)
    replace group = group - 1 // MAKE GROUP A 0/1 VARIABLE
    by group, sort: summ callpredicted
    Note: You didn't say anything about your data set. If the number of observations is in the millions, then generate two double-precision random number variables, shuffle1 and shuffle2, and sort the data on both of them before generating the in_sample variable, to assure a unique sort order and deterministic results.
    Last edited by Clyde Schechter; 01 Mar 2019, 10:03.

    Comment


    • #3
      Thanks for your reply. This is indeed very useful. I have tried to run your code and I can't quite figure out the "Fit model in 1/3 sample" bit as dprobit call... if in sample does not seem to be working. Moreover I need to add a condition when sum callpredicted, as well as below and above the median I also need observation if race=black or white, when trying to add if race=="w" or race=="b" it does not seem to be working.

      Many thanks.

      Comment

      Working...
      X