Long vs. short format database in an OLS regression

Jean Jacques

Join Date: Sep 2020

Posts: 97
#1

Long vs. short format database in an OLS regression

02 Oct 2022, 04:57

Hi everybody,

I have a very simple (I think) question. I have the following database regarding the time in which someone has been interrupted in a conversation and weather the tone of the interruption was angry or not:

interruption gender age angry time id

3 1 37 1 10 1

3 1 37 0 15 1

3 1 37 1 20 1

2 0 25 0 12 2

2 0 25 1 18 2

This database can be written in a short way like this:
interruption gender age id

3 1 37 1

2 0 25 2

If I want to know the effect of gender on the number of interruptions someones receives in a conversation a regression would be:

Code:

reg interruption gender

Which should give the same result either I run it using the data of Table 1 or the data of Table 2. However I'm not having the same results. Why?

While seems natural for me to use table 1 to know the effect of gender on the tone of the interruption (variable angry) doing

Code:

reg angry gender age

I don't feel is the right arrangement of the database for the first question.

Thanks a lot,
JJ

Last edited by Jean Jacques; 02 Oct 2022, 04:59.
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#2

02 Oct 2022, 05:33

In 99% of the work you do, you'll need your data to be in long format (the first one). I happen to use wide formats a lot, but that's really situation specific, so you wanna long shape to your data.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#3

02 Oct 2022, 05:38

Jean Jacques:
as an aside to Jared's really wise advice, if you have repeated measurements of the same set of variabes on the same sample of individials, you may want to consider -xtreg- instead of -regress-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jean Jacques

Join Date: Sep 2020

Posts: 97
#4

02 Oct 2022, 06:58

Thanks guys! Wouldn't I have multicollinearity if I do that? The reason why I'm doing

Code:

reg interruption gender

in the short database and not i the long one is just that (assuming i did before xtset id).
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#5

02 Oct 2022, 07:24

Again in rare circumstances, a wide dataset will be useful, but this isn't one of them.

Also, the wide dataset isn't related MC. Stata drops predictors that're multi collinear. Trust me, I've used Stata for 6/7 years and I do panel data econometrics, you wanna go with the long setup here. Most Stata estimation commands will only work with long data. Reshaping is more of a data management tool than something you'll ever need for estimation.
Comment
Jean Jacques

Join Date: Sep 2020

Posts: 97
#6

02 Oct 2022, 07:43

Hey thanks. I wasn't arguing about the convenience of using the long format, but just trying to understand the logic given that as I said, doing the regression that I proposed would lead to multicollinearity.

I mean, how to estimate the impact of gender on the number of interruptions someone receives using the long format without having multicollinearity following the database (in the long format) that I shared before. I just come up with

Code:

reg interruption gender age

Thanks!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

02 Oct 2022, 08:07

Let me address your original question.

Which should give the same result either I run it using the data of Table 1 or the data of Table 2.

That is not correct. Using the first dataset, you have a sample size of N = 5 while in the second dataset you have a sample size of N = 2. Other quantities - the means and variances and correlations - will similarly be calculated differently.

For the outcome and the independent variables you are using - interruption and gender -the model you are fitting is at the individual level - the variables are by definition the same for every observation of each individual. So you effectively have three copies of the first individual and two copies of the second individual, and the copies are not independent - the error term is identical.

For the model you are fitting, there should be a single observation for each individual. The results from the first dataset are incorrect.
1 like
Comment
Jean Jacques

Join Date: Sep 2020

Posts: 97
#8

02 Oct 2022, 08:13

Indeed that's my concern and that's why I'm using the second (short) version of the dataset.
Comment

interruption	gender	age	angry	time	id
3	1	37	1	10	1
3	1	37	0	15	1
3	1	37	1	20	1
2	0	25	0	12	2
2	0	25	1	18	2

interruption	gender	age	id
3	1	37	1
2	0	25	2

Announcement

Long vs. short format database in an OLS regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment