Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plotting and estimating difference across time

    Say my hypothesis is: has the income gap between whites and blacks decreased over time?
    I start with a simple analysis of raw data, I can tabulate the mean wages for each group by each year and compute the difference. I can also test whether this difference is significant using t-test, each time limiting the test for the year analyzed.

    The issues are:
    1. Plotting this requires some data "destruction" with table, replace OR some sort of collapsing. I would have loved to avoid this. Have a plot where the x axis is time, the y axis is the the difference in wages between groups. Any way to achieve this?

    2. t-testing for each year separately cannot answer the question whether the difference decreases or widens across time. say in 1990 the difference is 1.8 and this is statically significant. in 1991 the difference is 1.799 and it is also statistically significant. but is the difference between 1.8 & 1.799 statistically significant? disjoint t-test cannot provide an answer obviously.

    Code example:

    Code:
    clear all
    webuse nlswork, clear
    drop if race == 3
    
    *estimating the difference in means*
    bysort year: ttest ln_wage, by(race)
    
    *graphing the difference*
    collapse ln_wage, by(race year)
    reshape wide ln_wage, i(year) j(race)
    rename ln_wage1 ln_wage_white
    rename ln_wage2 ln_wage_black
    gen diff = ln_wage_white - ln_wage_black
    
    twoway line diff year, yline(0) ylabel(-0.2(0.1)0.2)

  • #2
    As a broad research question "has the income gap between whites and blacks decreased over time" is fine. But it is not a specific statistical hypothesis that can be tested. It is subject to many different interpretations and conditions, which would require different analyses.

    Moreover, the data you used is a fairly complex one. Different people are surveyed at different times. Consequently the jobs and industries of the black and white populations are different at different time points, as are the age and education distributions, proportions of union membership. And these covariate distributions may differ between whites and blacks at each time point as well. So a simple contrast of wage by race can be quite misleading. The use of a t-test is, in addition, very problematic here because multiple responses by the same person are not independent, which is a violation of a strong requirement for the use of the Student t-test.

    You can generate yearly comparisons of black and white wages, adjusted for other differences, with the following code:

    Code:
    xtreg ln_wage ib2.race##i.year c.age i.msp i.nev_mar c.grade i.not_smsa i.c_city i.south i.union c.wks_ue c.ttl_exp c.tenure c.hours, re
    
    margins year, dydx(race)
    marginsplot, noci yline(0) ylabel(-.2(0.1).2)
    Notes:
    1. You might prefer -xtgee- over -xtreg, re-; the results are quite similar, though.
    2. The use of ib2.race instead of i.race is to force the difference to be white - black instead of black - white.
    3. The options I have specified after -marginsplot- were for the purpose of styling the graph to look like the one you generated with the code in #1. That said, I don't see much value in having a y-axis that extends way outside the range of the data, and were I doing this for myself, I wouldn't keep any of those options (except maybe -noci-).
    4. The model here assumes linear relationships between ln_sage and the various continuous predictors. Theory or graphical exploration might lead to refinement of those specifications.

    Now, this still hasn't honed in on a test hypothesis. If you wanted to contrast different the wage difference in specific years, You could run -margins year, dydx(race) pwcompare- and then examine on the specific year contrasts you are interested in.

    If you wanted to test a different hypothesis, such as that there was a decline in the difference before 1980 and then it leveled off, then that would require a different model where time is represented not by i.year but by a linear-spline with a knot at 1980. Evidently one can dream up numerous hypotheses, and craft an appropriate model for testing them. The choice of hypothesis is a matter of science. If there is no science to guide the choice of hypothesis, then you are doing exploratory research and should feel free to test many, but then present your results only as the tentative results of exploration needing independent confirmation.

    Comment


    • #3
      I think Roger Newson's parmby could help you a lot in achieving your goals. It is part of the package parmest which you can download from SSC. With parmby you can easily ran your regressions year by year and plot the estimates along with their confidence intervals over time.
      Code:
      clear all
      webuse nlswork, clear
      drop if race == 3
      
      * Run the regressions and keep the coefficient of interest
      qui parmby "regr ln_wage i.race", by(year) norestore
      keep if parm == "2.race"
      
      * Plot the estimated difference in log wages over time
      graph twoway (line estimate year, scheme(s2mono)) ///
      (line min95 year, lpattern(dash_dot)) ///
      (line max95 year, lpattern(dash_dot))
      To your questions:
      1. You can enclose the code I gave with preserve and restore. The restore command would you bring back to the stage before you set the preserve. Nothing will be destroyed.
      2. Presenting confidence intervals like I did in the code will circumvent your problem. One can easily see whether zero lies within the confidence interval. If it does not, we can say that the difference is significant at 5%. But yould easily adjust parmby to compute wider or narrower confidence intervals. The 95% CI is just the default.

      EDIT

      Seeing Clyde's answer you should consider my post rather as a "way to achieve a certain graph" than as a solution to your statistical problem. Clyde's way is definitely the better one, where yould also could take into account the time-dependence between individuals with the cluster option.
      Last edited by Roberto Liebscher; 04 Mar 2017, 15:22.

      Comment


      • #4
        Thank you both for the comments and suggestions!

        Clyde: I know that one should account for the differences in various other covariates if I wish to test the hypothesis. What I want is to first do some sort of "unconditional means" (other than conditional on the given year and race, of course) as an exploratory analysis, then continue to on while also including in the model the differences in education, union membership, marital status etc.
        I should also note that my research question is not the wage gap between blacks and whites and the stata data example is not the data i actually use for this, it's just a close enough example for my hypothesis of the differences between groups across time in some outcome measure with repeated measures (panel) data.
        I thought about using an interaction between years and race but wasn't sure about this. I see that this produces the same difference as t-testing on each year - which assures me it's indeed "unconditional" as I wanted

        Roberto: Thanks for parmby, did not know of it. looks very useful indeed!

        Comment

        Working...
        X