Difference in Difference estimates with binary variables

Jeremy CELSE

Join Date: May 2019

Posts: 6
#1

Difference in Difference estimates with binary variables

20 May 2019, 02:51

Dear all

I am struggling in understanding and analysing my data with the diff function & I hope you will be able to help me.

the goal of my study is to see whether a marketing campaign can reduce fare evasion in public transportation. Let me explain how my data looks like: I have two cities in which I compare the fare evasion rates after an intervention aimed at decreasing fare evasion. the intervention only occurs in a city (treatment) but not in the other (control). so we assessed fare evasion rates before and after the intervention in both cities. I collected the data in two waves to increase the power of the study

The fare evasion variable is a binary with 0 = the passenger has a valid ticket & 1= the passenger was travelling with no ticket or no valid ticket.
The treatment variable is also a binary one with 0 = control and 1 = treatment
The time variable is binary 0=before the intervention & 1=after the intervention.

My problem is the following: I observe a significant decrease in fare evasion rates when I do statistical analyses (chi square tests) but if I perform the following code the difference in difference is non significant:

diff fareevasion if wave==1, t(treatment) p(time)

As I am running the analysis using only binary variables I guess there is a severe problem of collinearity. So my question is: How can I conduct the difference in difference analysis when I only have binary variables?

Thanks a lot for your answers and have a nice day

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(fareevasion treatment time wave) float _diff 1 1 0 2 0 0 0 0 2 0 0 0 0 1 0 1 0 0 2 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 2 0 0 1 0 1 0 0 1 0 2 0 0 1 0 1 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 1 0 1 0 0 0 0 2 0 0 1 0 1 0 0 0 0 2 0 0 0 0 1 0 0 1 0 2 0 0 1 0 1 0 1 0 0 2 0 0 1 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 0 0 1 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 2 0 0 1 0 2 0 0 0 0 2 0 0 0 0 2 0 0 1 0 2 0 0 0 0 2 0 0 1 0 2 0 0 1 0 1 0 0 0 0 2 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 2 0 1 0 0 2 0 0 0 0 2 0 0 0 0 1 0 1 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 2 0 0 1 0 1 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 0 0 0 0 1 0 1 1 0 2 0 0 1 0 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 1 0 1 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 0 0 1 0 2 0 0 0 0 1 0 0 1 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0 0 2 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 2 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

20 May 2019, 10:04

First of all, the difference between statistically significant and not statistically significant is not, itself, statistically significant. And, in fact, you shouldn't be using statistical significance any more: see the position paper by the American Statistical Association. https://www.tandfonline.com/doi/full...5.2019.1583913.

That aside, when you restrict the analysis to a subset of the data, you have a smaller sample size, so that even the same actual differences will be associated with larger (i.e. less "significant") p-values. So if you are going to compare analyses on different subsamples, it is only meaningful to look at the magnitudes of the coefficients: the p-values cannot be compared with each other.

If you want additional advice, when you post back, show a different data example that includes observations from all combinations of treatment, time, and wave. Also, show the actual commands you ran along with the output that Stata gave you from them.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2426
#3

20 May 2019, 10:29

I'd of course agree with Clyde's comments about statistical significance. I'd also say, though, that you have a design problem that can't be resolved statistically: I don't see how you can adjust for pre-existing difference between the two cities, the particular events that happen to be going on within the cities during study period, the differences in the age/gender,/ethnic/ economic composition of their populations, the differences in culture between the two cities, etc. Also, as you don't happen to say anything about how the fare evasion rates were estimated (sample of passengers??), it's hard to say much about what an appropriate analysis is. Finally, I don't see how collinearity is relevant here, since you have two variables, "city" and "time point," which are not correlated. While the design problem is likely not resolvable, *perhaps* there might be some useful advice that could be offered if we knew a bit more about the data collection.
Comment

Jeremy CELSE

Join Date: May 2019
Posts: 6

20 May 2019, 13:08

Hi

First of all I would like to thank you for the time you took answering me. Thanks for the paper, indeed it is particularly interesting!

So let me give youmore information as requested.

I have two cities (in fact two train stations) one serve as a baseline (station 1) the other as an experimental (station 2). I designed a messaging campaign aiming at reducing fare evasion. The messaging campaign was conducted only in the experimental train station (station 2). Train officials operated a ticket train inspection before boarding the train on the two train stations before & after the messaging campaign. I collected the data through two waves.

Thus I have the following information for the two train stations (cumulated data from the two waves) that was given by the train inspectors:

	Station 1	Station 2
(Nb of Fare Evaders before campaign)/(Total number of passengers controlled before campaign)	136/1459	53/1005
(Nb of fare evaders after campaign)/(Total number of passengers controlled after campaign)	92/650	103/1612

Since it is a field experiment, there is a lack of control. we do not control the number of passengers inspected, the characteristics of the passengers (that may be used as covariates...). In my mind, the only way to examine the impact of my messaging campaign is to conduct a Difference-In-Difference analysis

So I coded the data as the following:

I created a variable nammed fareevasion that captures whether a passenger inspected was fare evading or not. if the passenger was fare evading, the variable equals 1 and 0 otherwise. So I have on aggregate 384 positive values and 4342 null values associated with the variable
I created a variable nammed treatment that disentangles the train station. Treatment equals 1 when the data has been collected at train station number 2 (experimental) and 0 otherwise. So I have on aggregate 2464 null values (observations coming from station 1) and 2262 positive values (coming from station 2) associated with Treatment
I created a variable nammed time that disentangles whether the data has been collected before (=0) or after (=1) the messaging campaign. So I have 2109 null values (data collected before the messaging campaign) and 2617 positive values (data collected after the messaging campaign) associated with Time.

And I used the diff function in stata as the following:

diff fareevasion, t(treatment) p(time) robust
and I get the following results

Click image for larger version

Name: Capture.JPG
Views: 1
Size: 47.8 KB
ID: 1499183

So basically the results show that there is no impact of the messaging campaign.

My questions:

Did I coded the data correctly?
Did I used the correct function & syntax?
Since it is a binary dependent variable it might be better to use a logit or probit model but if there is no effect using a linear model I am pessimistic about getting a positive result in a logit model.

Any suggestions because I am running out of ideas (and I am not expert in econometrics)?

Thanks in advance & good night (I am in europe ;^)

J

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#5

20 May 2019, 22:55

Mike Lacy makes some good points about problems with the design of this study. But let's put those aside, since it sounds like you cannot really do anything about those problems without arranging logistics for a much more complicated study.

From your description of the variables, it sounds like they are properly coded, and your -diff- command is properly coded.

As for whether to do a logistic model, I think it is reasonable to do that here since the outcome has a fairly low probability, so that the linear probability model and normal theory approximations underlying it are likely to be a poor fit to the data. The -diff- command does not do that, so you will have to code the logistic regression yourself and then interpret the results.

Code:

logit fareevasion i.treatment##i.time

The coefficient of 1.treatment#1.time will be your difference in differences estimator of the intervention effect in the log-odds metric. If you would prefer to have that in probabilities you can follow it with

Code:

margins treatment, dydx(time) pwcompare
Comment
Jeremy CELSE

Join Date: May 2019

Posts: 6
#6

21 May 2019, 02:07

Hello

Thanks for the feedback.
Yes I know about the design and it seems complicated to solve these issues.

If I understood correctly the code, the following command "i.treatment##I.time" will be equivalent to regressing fareevasion by treatment, time & by the interaction of treatment with time?

Thanks for the code regarding the marginal effects. Since it is a logit model should I focus on the ODDS ratio?

Interestingly the impact is significant for the first wave but not for the second and thus if I cumulate data from both waves nothing significant emerges...

Thanks in advance

Wish you a pleasant day

J
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#7

21 May 2019, 08:20

If I understood correctly the code, the following command "i.treatment##I.time" will be equivalent to regressing fareevasion by treatment, time & by the interaction of treatment with time?

Yes. By the way "i.treatment##i.time" is an expression, not a command.

Since it is a logit model should I focus on the ODDS ratio?

I would consider this a matter of taste. The log odds is the natural metric of the logistic regression model. The difficulty is that many people are uncomfortable with log-odds and odds and odds ratios and don't understand them. So it depends on who the audience for this work is. If they are conversant with odds ratios, then, yes, I would present the results in those terms. But if they are not, or if it is a mixed audience, I would also present the results in terms of probabilities, which people tend to grasp more easily.
Comment

Announcement