Consequences of removing 0s in count data within a Heckman situation

Maxence Morlet

Join Date: Mar 2021

Posts: 652
#1

Consequences of removing 0s in count data within a Heckman situation

09 Aug 2022, 09:35

Hi all,

More of a conceptual, econometric question than a data-driven question.

I have a panel dataset, and want to run regressions with the total number of new job positions for the next recruitment period (the next month) as dependent variable. The dependent variable is survey respondent, region and month-specific, and is a count variable.

The dependent variable itself is a rowtotal of vacant and filled positions.

The issue is that in our survey, the question producing the two abovementioned variable was only filled out by survey respondents who wished to recruit additional workers, and it was an optional question. We therefore have a lot of missings in this variable, which were recoded as zeros.

This configuration looks like a classic Heckman situation, with the observability of a strictly positive value for the dependent variable being endogenous and a function of selection.

My question is:

- What would be the consequence of removing the zeros and running either nonlinear or linear models with multiple fixed-effect vectors on the trimmed sample?

I know that ignoring the problem of selection and running OLS on the entire sample yields bias (e.g. Johnston and DiNardo, 1997). However, what happens if we only consider the strictly positive subset of the data? I have been unable to find literature on this topic...
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10481
#2

09 Aug 2022, 13:11

Either omission or inclusion of the zero observations will result in biased coefficient estimates. You can derive this result formally (here, I just deal with the case of omission that you ask about). In the general model, you have $y_{i}= \beta^{\prime}x_{i} + u_{i}$ if $\beta^{\prime}x_{i} + u_{i} >0$, where we suppose $u_{i}\sim N(0, \sigma^{2})$. Otherwise you have $y_i= 0$ if $\beta^{\prime}x_{i} + u_{i} \leq0$. For omission of zero observations:

$$E[u_i|\beta^{\prime}x_{i} + u_{i} >0]= E[u_i|u_i> -\beta^\prime x_i]=\sigma\lambda_i >0$$

where $\lambda$ is the inverse Mills ratio. Correspondingly,

$$E[u_i x_i|\beta^{\prime}x_{i} + u_{i} >0]= E[u_i x_i|u_i> -\beta^\prime x_i]=\sigma\lambda_i x_i\neq0$$

and thus the first two OLS assumptions are violated. OLS on the continuous sample thus yields biased and inconsistent estimates. Note that the above result was derived by noting that for a standard normal variable $z\sim N(0,1)$, you have:

$$E[z|z>a]= \frac{\int_{a}^{\infty} z\phi(z)dz}{1-\Phi(a)}=-\frac{\int_{a}^{\infty} d\phi(z)}{1-\Phi(a)}= \frac{\phi(a)}{1-\Phi(a)} = \lambda(a).$$
1 like
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 652
#3

11 Aug 2022, 09:37

Thanks Andrew! So in this case, a simple OLS coefficient would be downwards-biased, right? So it would simply be a lower bound for the true effect?

Would you know of any factors / circumstances which could exacerbate or diminish the magnitude of this bias?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10481
#4

11 Aug 2022, 12:37

Originally posted by Maxence Morlet View Post

Thanks Andrew! So in this case, a simple OLS coefficient would be downwards-biased, right?

Yes, OLS coefficients are biased downwards.

Would you know of any factors / circumstances which could exacerbate or diminish the magnitude of this bias?

The obvious factor is and the magnitude of the censoring (many observations or just a few). I do not know of any studies that systematically investigate such factors as there exist ways to address the problem:

(i) either using maximum likelihood assuming some distribution for the residuals or better,
(ii) using a general selection model that distinguishes between the observation and selection processes.

Last edited by Andrew Musau; 11 Aug 2022, 12:42.
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 652
#5

11 Aug 2022, 12:56

Thanks for your reply.

Well this would be a Heckman situation, thing is in the dataset we use I have no variables that would plausibly serve to explain selection, i.e. the choice of firms to recruit. Hence my whole concern...
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10481
#6

11 Aug 2022, 13:05

How many time periods do you have?
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 652
#7

11 Aug 2022, 14:34

It's an unbalanced dataset unfortunately, but circa 6 months on average per survey respondent.

We however have around 825 observations for each respondent, because each respondent i responds for new positions in month t, occupation o, region r.

Also, you mentioned the magnitude of the censoring; we have 7.2M observations overall (approximately), and the number of new positions is strictly greater than zero for only circa 45K observations. So it is pretty large...
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10481
#8

12 Aug 2022, 01:48

So 6 months implies $T=6$? In Greene's simulations (p.139), he notes

With T equal to only 5, the [unconditional FE Tobit estimator] appears to be only slightly affected by the incidental parameters problem. Even at T = 3, the 4% upward bias in the marginal effects in the tobit model is likely to be well within the range of the sampling variability of the estimated parameter and the roughly 12% downward bias in the estimated standard errors will usually not reverse a conclusion about significance.

so there is scope to estimate using tobit including dummies for the FEs. A second possibility is to use pantob which implements the estimators developed in Honoré (1992). Either of these will be an improvement over omitting the zero observations.

References:

Greene, W. "Fixed Effects and Bias Due to the Incidental Parameters Problem in the Tobit Model." Econometric Reviews, 23 (2004), 125-147.

Honoré, Bo E. "Trimmed Lad and Least Squares Estimation of Truncated and Censored Regression Models with Fixed Effects," Econometrica, 60 (1992), 533-565.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10481
#9

12 Aug 2022, 01:52

pantob is available from https://www.princeton.edu/~honore/st...ob_version_0.6.
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 652
#10

12 Aug 2022, 02:03

Thank you very much Andrew! I'll give those methods a try.
Comment

Announcement

Consequences of removing 0s in count data within a Heckman situation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment