Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Long vs. short format database in an OLS regression

    Hi everybody,

    I have a very simple (I think) question. I have the following database regarding the time in which someone has been interrupted in a conversation and weather the tone of the interruption was angry or not:
    interruption gender age angry time id
    3 1 37 1 10 1
    3 1 37 0 15 1
    3 1 37 1 20 1
    2 0 25 0 12 2
    2 0 25 1 18 2
    This database can be written in a short way like this:
    interruption gender age id
    3 1 37 1
    2 0 25 2
    If I want to know the effect of gender on the number of interruptions someones receives in a conversation a regression would be:

    Code:
    reg interruption gender
    Which should give the same result either I run it using the data of Table 1 or the data of Table 2. However I'm not having the same results. Why?

    While seems natural for me to use table 1 to know the effect of gender on the tone of the interruption (variable angry) doing

    Code:
    reg angry gender age
    I don't feel is the right arrangement of the database for the first question.

    Thanks a lot,
    JJ
    Last edited by Jean Jacques; 02 Oct 2022, 04:59.

  • #2
    In 99% of the work you do, you'll need your data to be in long format (the first one). I happen to use wide formats a lot, but that's really situation specific, so you wanna long shape to your data.

    Comment


    • #3
      Jean Jacques:
      as an aside to Jared's really wise advice, if you have repeated measurements of the same set of variabes on the same sample of individials, you may want to consider -xtreg- instead of -regress-.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Thanks guys! Wouldn't I have multicollinearity if I do that? The reason why I'm doing
        Code:
        reg interruption gender
        in the short database and not i the long one is just that (assuming i did before xtset id).

        Comment


        • #5
          Again in rare circumstances, a wide dataset will be useful, but this isn't one of them.

          Also, the wide dataset isn't related MC. Stata drops predictors that're multi collinear. Trust me, I've used Stata for 6/7 years and I do panel data econometrics, you wanna go with the long setup here. Most Stata estimation commands will only work with long data. Reshaping is more of a data management tool than something you'll ever need for estimation.

          Comment


          • #6
            Hey thanks. I wasn't arguing about the convenience of using the long format, but just trying to understand the logic given that as I said, doing the regression that I proposed would lead to multicollinearity.

            I mean, how to estimate the impact of gender on the number of interruptions someone receives using the long format without having multicollinearity following the database (in the long format) that I shared before. I just come up with

            Code:
             reg interruption gender age
            Thanks!

            Comment


            • #7
              Let me address your original question.

              Which should give the same result either I run it using the data of Table 1 or the data of Table 2.
              That is not correct. Using the first dataset, you have a sample size of N = 5 while in the second dataset you have a sample size of N = 2. Other quantities - the means and variances and correlations - will similarly be calculated differently.

              For the outcome and the independent variables you are using - interruption and gender -the model you are fitting is at the individual level - the variables are by definition the same for every observation of each individual. So you effectively have three copies of the first individual and two copies of the second individual, and the copies are not independent - the error term is identical.

              For the model you are fitting, there should be a single observation for each individual. The results from the first dataset are incorrect.

              Comment


              • #8
                Indeed that's my concern and that's why I'm using the second (short) version of the dataset.

                Comment

                Working...
                X