Regression analysis based on truncated sample data

Michaella Allen

Join Date: Aug 2015

Posts: 5
#1

Regression analysis based on truncated sample data

21 Oct 2017, 08:36

Hi,

I'm currently working with nationally representative, but confidential, household survey data which looks at the level of ICT access and usage for randomly selected individuals within a given population. It has 1771 unique observations and can be disaggregated across various sub-criteria.

For the purposes of my study however, I would like to restrict this dataset to only look at the urban poor (according to a national income poverty line) in order to estimate their probability of being digitally poor. Given that I am effectively truncating my data and analysing a non-randomly selected sample, I was wondering if there is there any way in which I can perform a regression analysis without producing biased estimates ?

Although I am aware of the truncreg command, I'm not sure it's appropriate to use in this case since my dependent variable is not the variable I am truncating. The dependent variable for my study is a categorical variable for digital poverty, and I am truncating the sample to only include those individuals with a monthly per capita income of less than or equal to 758.

I would ideally like to run a generalised ordered logistic (gologit2) regression, but I don't want to provide misleading results. Therefore, if there is any way in which I can control for this sample selection bias, I would be extremely grateful for any guidance on how to achieve it.

Many thanks in advance for any advice provided!
Tags: categorical, logistic, regression, truncation

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

21 Oct 2017, 11:18

Michaella:
I do not think any truncation is necessary.
Provided that you have to keep the survey structuire of your data into account, you can simply calculate inferential statistics imposing an -if- condition:

Code:

use http://www.stata-press.com/data/r14/nhanes2.dta
..svy: regress highbp i.race if age<42
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        31                 Number of obs     =       4,204
Number of PSUs     =        62                 Population size   =  60,739,447
                                               Design df         =          31
                                               F(   2,     30)   =        2.80
                                               Prob > F          =      0.0768
                                               R-squared         =      0.0022

------------------------------------------------------------------------------
             |             Linearized
      highbp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      Black  |   .0635273   .0264146     2.41   0.022     .0096544    .1174001
      Other  |   .0227803   .0641107     0.36   0.725    -.1079743    .1535349
             |
       _cons |   .2142075   .0129146    16.59   0.000      .187868    .2405469
------------------------------------------------------------------------------

Kind regards,
Carlo
(Stata 19.0)

Comment

Michaella Allen

Join Date: Aug 2015

Posts: 5
#3

21 Oct 2017, 12:19

Thanks for your advice, Carlo.
But just to be clear, wouldn't the regression line below limit the consistency of the estimators if I'm ignoring a not-insignificant portion of the sample?

Code:

xi: gologit2 digpovreg i.EA_true Rmonthlyinc i.female age hhsize i.homelang if Rmonthlyinc<=758 & geoloc==1 & !missing (digpovreg)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

22 Oct 2017, 05:39

Michaella:
I would say that it depends on whether the statistical plan of the study allows (or not) subgroups analysis.
Otherwise, you can analyze the full sample and add, among predictors, a two-level categorical variable (0=above the poverty line; 1=below the poverty line).

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Regression analysis based on truncated sample data

Comment

Comment

Comment