Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression analysis based on truncated sample data

    Hi,

    I'm currently working with nationally representative, but confidential, household survey data which looks at the level of ICT access and usage for randomly selected individuals within a given population. It has 1771 unique observations and can be disaggregated across various sub-criteria.

    For the purposes of my study however, I would like to restrict this dataset to only look at the urban poor (according to a national income poverty line) in order to estimate their probability of being digitally poor. Given that I am effectively truncating my data and analysing a non-randomly selected sample, I was wondering if there is there any way in which I can perform a regression analysis without producing biased estimates ?

    Although I am aware of the truncreg command, I'm not sure it's appropriate to use in this case since my dependent variable is not the variable I am truncating. The dependent variable for my study is a categorical variable for digital poverty, and I am truncating the sample to only include those individuals with a monthly per capita income of less than or equal to 758.

    I would ideally like to run a generalised ordered logistic (gologit2) regression, but I don't want to provide misleading results. Therefore, if there is any way in which I can control for this sample selection bias, I would be extremely grateful for any guidance on how to achieve it.

    Many thanks in advance for any advice provided!

  • #2
    Michaella:
    I do not think any truncation is necessary.
    Provided that you have to keep the survey structuire of your data into account, you can simply calculate inferential statistics imposing an -if- condition:
    Code:
    use http://www.stata-press.com/data/r14/nhanes2.dta
    ..svy: regress highbp i.race if age<42
    (running regress on estimation sample)
    
    Survey: Linear regression
    
    Number of strata   =        31                 Number of obs     =       4,204
    Number of PSUs     =        62                 Population size   =  60,739,447
                                                   Design df         =          31
                                                   F(   2,     30)   =        2.80
                                                   Prob > F          =      0.0768
                                                   R-squared         =      0.0022
    
    ------------------------------------------------------------------------------
                 |             Linearized
          highbp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            race |
          Black  |   .0635273   .0264146     2.41   0.022     .0096544    .1174001
          Other  |   .0227803   .0641107     0.36   0.725    -.1079743    .1535349
                 |
           _cons |   .2142075   .0129146    16.59   0.000      .187868    .2405469
    ------------------------------------------------------------------------------
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks for your advice, Carlo.
      But just to be clear, wouldn't the regression line below limit the consistency of the estimators if I'm ignoring a not-insignificant portion of the sample?

      Code:
       xi: gologit2 digpovreg i.EA_true Rmonthlyinc i.female age hhsize i.homelang  if Rmonthlyinc<=758 & geoloc==1 & !missing (digpovreg)

      Comment


      • #4
        Michaella:
        I would say that it depends on whether the statistical plan of the study allows (or not) subgroups analysis.
        Otherwise, you can analyze the full sample and add, among predictors, a two-level categorical variable (0=above the poverty line; 1=below the poverty line).
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment

        Working...
        X