Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to predict out of sample values after svy logistic

    Hi there,

    I am working with survey data and the svy commands, and wondering: how can I predict out-of-sample values after svy logit or svy logistic?

    In my dataset:
    • The records are from five different survey rounds, where the variable, source indicates which round the data is from.
    • The sampling weight is round-specific and is stored in the variable "FQweight".
    The research question/goal:
    • The samples are two-stage cluster designs with urban-rural and major regions as strata. Given Rounds 1-4 are sampled from a different area than Round 5, the research question here is: how well does R1-R4 data predict R5 data? In other words, for outcomes/variables which are expected to change over the time between R1 and R5, I would like to see if the Round1-4 data accurately predicts the R5 data or not, and, if it does then assess how well the R1-4 data predicts the R5 data.
    Analysis plan:
    • To address the research question, the plan is to subset the sample to Rounds1-4 only
    • Then run svy: logistic where
      • the Y-var is binary and is expected to change over time, and where
      • the only X-var is a variable called cmc which stores the stores the time of interview.
      • e.g.: svy: logistic mobile cmc
    • Then, use the post-estimation predict command to predict R5 values. However, this is where my first question is (see below).
    • Then, compare the predicted R5 value with the actual R5 value (this is where my second question is).
    My questions are:
    • Given the original/non-subsetted dataset contains R5 interview times but I conducted the svy: logistic only for the R1-R4 sample, is it possible to input the R5 interview time data and predict a y-value for Round 5 based on the above svy logistic results for only R1-4 data? How?
    • After the above, how would i assess how accurate the R5 prediction is compared to the the actual R5 data I have in the original dataset?
    Overall, would you suggest a different approach to answering the research question?

    I'm looking through the Stata 15 manual for svy commands and I see the post-estimation predict command but I don't see a detailed example on how to predicto out of sample values (link: https://www.stata.com/manuals/svy.pdf).

    Thank you so much for your time and input - I appreciate it!! Please let me know if I can be clearer.
    Last edited by Maisha Huq; 08 Sep 2018, 21:00.

  • #2
    Hi there, I edited my original question to be clearer that I see the svy postestimation predict command but am not clear how to use it to predict out of sample values. thank you!

    Comment


    • #3
      It might be helpful to show your actual commands and explain how the data are structured. Is the data in wide format or long? How did you restrict the analysis to the first 4 waves?

      At a minimum though I think you could create a gen statement. Suppose the original analysis yields something like

      y = ,3X1

      You could then do something like

      gen y5hat = .3*x1 if wave == 5.

      In general, though predict usually generates out of sample predictions unless you explicitly tell it not to. So the same predict command that generates predicted values for waves 1-4 will do it for wave 5. Unless maybe the data are in wide format rather than long.

      In short I am pretty sure you can do it. We would need more info (or at least I would) to tell you how best to do it.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Hi Richard

        Thanks so much for getting back! So, in the dataset here, it is long, it includes household-level records; originally, all records from rounds 1-5 are pooled in one sample; for running the svy logit and predict steps below, I subset the original dataset so one version includes only records 1-4 another includes all records from rounds 1-5.

        I run the svy logit steps on the former dataset, ie only records 1-4 so that I can see how well the rounds1-4 data predicts the round 5 data. Would you say the way to do this is the code below?: I'm not understanding the results fully (also pasted below) :

        Code:
        . use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1.dta", clear
        
        . keep if inlist(source, 1, 2, 3, 4) & FQmetainstanceID==""
        (74,568 observations deleted)
        
        . save "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-1234hh.dta", replace
        file /Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-1234hh.dta saved
        
        . 
        . use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1.dta", clear
        
        . keep if inlist(source, 5) & FQmetainstanceID==""
        (119,512 observations deleted)
        
        . save "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-5hh.dta", replace
        file /Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-5hh.dta saved
        
        . 
        . use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-1234hh.dta", clear
        
        . svyset EA_ID [pweight=HHweight]
        
              pweight: HHweight
                  VCE: linearized
          Single unit: missing
             Strata 1: <one>
                 SU 1: EA_ID
                FPC 1: <zero>
        
        . svy: logistic mobile hh_cmc
        (running logistic on estimation sample)
        
        Survey: Logistic regression
        
        Number of strata   =         1                  Number of obs     =     57,614
        Number of PSUs     =       110                  Population size   = 58,810.472
                                                        Design df         =        109
                                                        F(   1,    109)   =      12.68
                                                        Prob > F          =     0.0005
        
        ------------------------------------------------------------------------------
                     |             Linearized
              mobile | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
              hh_cmc |   1.011959   .0033785     3.56   0.001     1.005285    1.018677
               _cons |   9.94e-08   4.59e-07    -3.49   0.001     1.06e-11    .0009341
        ------------------------------------------------------------------------------
        Note: _cons estimates baseline odds.
        
        . use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-5hh.dta", clear
        
        . predict probhat
        (option pr assumed; Pr(mobile))
        
        . summarize probhat mobile
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
             probhat |     16,333    .6491858    .0009743   .6487713   .6514753
              mobile |     15,977    .6206422    .4852424          0          1
        
        .

        Comment


        • #5
          I'm somewhat nervous about how you deleted cases along the way. It is usually recommended that you keep all cases and use the subpop option of svy to restrict the sample. Deleting cases potentially screws up the standard errors. See

          https://stats.idre.ucla.edu/stata/fa...data-in-stata/

          If it was me, I probably would have used all 5 years together, and then used the subpop option to analyze only the first 4 waves. Then I would have used the predict option, limiting it to the wave 5 cases I wanted if I didn't want the other years.

          I'm not sure that this would have much, or any, effect on the results you are most interested in. You can try it and see.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          Stata Version: 17.0 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment

          Working...
          X