How to predict out of sample values after svy logistic

Maisha Huq

Join Date: Jul 2025

Posts: 15
#1

How to predict out of sample values after svy logistic

08 Sep 2018, 20:35

Hi there,

I am working with survey data and the svy commands, and wondering: how can I predict out-of-sample values after svy logit or svy logistic?

In my dataset:
The records are from five different survey rounds, where the variable, source indicates which round the data is from.

The sampling weight is round-specific and is stored in the variable "FQweight".

The research question/goal:
The samples are two-stage cluster designs with urban-rural and major regions as strata. Given Rounds 1-4 are sampled from a different area than Round 5, the research question here is: how well does R1-R4 data predict R5 data? In other words, for outcomes/variables which are expected to change over the time between R1 and R5, I would like to see if the Round1-4 data accurately predicts the R5 data or not, and, if it does then assess how well the R1-4 data predicts the R5 data.

Analysis plan:
To address the research question, the plan is to subset the sample to Rounds1-4 only

Then run svy: logistic where
the Y-var is binary and is expected to change over time, and where

the only X-var is a variable called cmc which stores the stores the time of interview.

e.g.: svy: logistic mobile cmc

Then, use the post-estimation predict command to predict R5 values. However, this is where my first question is (see below).

Then, compare the predicted R5 value with the actual R5 value (this is where my second question is).

My questions are:
Given the original/non-subsetted dataset contains R5 interview times but I conducted the svy: logistic only for the R1-R4 sample, is it possible to input the R5 interview time data and predict a y-value for Round 5 based on the above svy logistic results for only R1-4 data? How?

After the above, how would i assess how accurate the R5 prediction is compared to the the actual R5 data I have in the original dataset?

Overall, would you suggest a different approach to answering the research question?

I'm looking through the Stata 15 manual for svy commands and I see the post-estimation predict command but I don't see a detailed example on how to predicto out of sample values (link: https://www.stata.com/manuals/svy.pdf).

Thank you so much for your time and input - I appreciate it!! Please let me know if I can be clearer.

Last edited by Maisha Huq; 08 Sep 2018, 21:00.
Tags: None
Maisha Huq

Join Date: Jul 2025

Posts: 15
#2

08 Sep 2018, 21:06

Hi there, I edited my original question to be clearer that I see the svy postestimation predict command but am not clear how to use it to predict out of sample values. thank you!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#3

09 Sep 2018, 15:16

It might be helpful to show your actual commands and explain how the data are structured. Is the data in wide format or long? How did you restrict the analysis to the first 4 waves?

At a minimum though I think you could create a gen statement. Suppose the original analysis yields something like

y = ,3X1

You could then do something like

gen y5hat = .3*x1 if wave == 5.

In general, though predict usually generates out of sample predictions unless you explicitly tell it not to. So the same predict command that generates predicted values for waves 1-4 will do it for wave 5. Unless maybe the data are in wide format rather than long.

In short I am pretty sure you can do it. We would need more info (or at least I would) to tell you how best to do it.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Maisha Huq

Join Date: Jul 2025
Posts: 15

11 Sep 2018, 11:08

Hi Richard

Thanks so much for getting back! So, in the dataset here, it is long, it includes household-level records; originally, all records from rounds 1-5 are pooled in one sample; for running the svy logit and predict steps below, I subset the original dataset so one version includes only records 1-4 another includes all records from rounds 1-5.

I run the svy logit steps on the former dataset, ie only records 1-4 so that I can see how well the rounds1-4 data predicts the round 5 data. Would you say the way to do this is the code below?: I'm not understanding the results fully (also pasted below) :

Code:

. use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1.dta", clear

. keep if inlist(source, 1, 2, 3, 4) & FQmetainstanceID==""
(74,568 observations deleted)

. save "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-1234hh.dta", replace
file /Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-1234hh.dta saved

. 
. use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1.dta", clear

. keep if inlist(source, 5) & FQmetainstanceID==""
(119,512 observations deleted)

. save "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-5hh.dta", replace
file /Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-5hh.dta saved

. 
. use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-1234hh.dta", clear

. svyset EA_ID [pweight=HHweight]

      pweight: HHweight
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: EA_ID
        FPC 1: <zero>

. svy: logistic mobile hh_cmc
(running logistic on estimation sample)

Survey: Logistic regression

Number of strata   =         1                  Number of obs     =     57,614
Number of PSUs     =       110                  Population size   = 58,810.472
                                                Design df         =        109
                                                F(   1,    109)   =      12.68
                                                Prob > F          =     0.0005

------------------------------------------------------------------------------
             |             Linearized
      mobile | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      hh_cmc |   1.011959   .0033785     3.56   0.001     1.005285    1.018677
       _cons |   9.94e-08   4.59e-07    -3.49   0.001     1.06e-11    .0009341
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

. use "/Users/maishahuq/Desktop/EA Refresh Analysis/Combined-UG-v1-5hh.dta", clear

. predict probhat
(option pr assumed; Pr(mobile))

. summarize probhat mobile

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     probhat |     16,333    .6491858    .0009743   .6487713   .6514753
      mobile |     15,977    .6206422    .4852424          0          1

.

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5008
#5

11 Sep 2018, 11:44

I'm somewhat nervous about how you deleted cases along the way. It is usually recommended that you keep all cases and use the subpop option of svy to restrict the sample. Deleting cases potentially screws up the standard errors. See

https://stats.idre.ucla.edu/stata/fa...data-in-stata/

If it was me, I probably would have used all 5 years together, and then used the subpop option to analyze only the first 4 waves. Then I would have used the predict option, limiting it to the wave 5 cases I wanted if I didn't want the other years.

I'm not sure that this would have much, or any, effect on the results you are most interested in. You can try it and see.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement