Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble working with panel data and including too many dummy variables

    I am analyzing public and charter high schools in the state of California. I am interested in Charter school performance compared to non charter schools, controlling for other things like geographic area, class size, demographics. These are my variables:
    Independent

    CurrentExpensePerADA
    Expenditure per average day of attendance
    CurrentExpensePerADA_Sqr CurrentExpensePerADA^2
    CharterSchool =1 if charter, =0 if public
    Rural =1 if rural, =0 if other
    Town =1 if town, = 0 if other
    Suburb =1 if suburb, =0 if other
    Urban =1 if urban, =0 if other
    CharterRural interaction effect CharterSchool*Rural
    CharterTown interaction effect CharterSchool*Town
    CharterSuburb interaction effect CharterSchool*Suburb
    lnTotalStudentsAllGrades log transformed total student population
    FreeandReduced % of Free or Reduced lunch students at school
    AsianMajority =1 if >30% Asian, =0 if <30% Asian
    HispanicMajority =1 if >50% Hispanic, =0 if <50% Hispanic
    BlackMajority =1 if >30% Black, =0 if <30% Black
    WhiteMajority =1 if >50% White, =0 if <50% White
    CharterAsian interaction effect CharterSchool*AsianMajority
    CharterHispanic interaction effect CharterSchool*HispanicMajority
    CharterBlack interaction effect CharterSchool*BlackMajority
    CharterWhite interaction effect CharterSchool*WhiteMajority
    lnPupilTeacherRatio log transformed pupils per teacher ratio
    Dependent

    APIBaseScore
    California scores 200-1000 (mainly test scores)

    Previously, I had race demographics entered as percentages, but made them binary because of non-linearity problems. I log transformed my non-binary variables TotalStudents and PupilTeacherRatio and that fixed my linearity issue, according to the scatterplot I ran. Luckily FreeandReduced was already linear with API, because log transform wouldn't work with so many 0 values in this variable. Does this data set structure make sense, and is it okay that most my right-hand side variables are binary? Here is a OLS regression I ran for just the year 2010:
    reg Y_Dep X_Ind, robust
    Click image for larger version

Name:	Linear Robust (1).png
Views:	1
Size:	42.8 KB
ID:	1492089

    Another question: Because my variable of interest CharterSchool is time invariant I ran random effects so it doesn't drop out. I am hoping I can find something more interesting using panel data from years 2006-2009, because with cross-section there's too many limitations in the model (i.e. self selection bias) to determine causality. I am quite inexperienced working with panel data, but here is the model I ran. Is this sufficient given my set of variables?



    Note: I sorted it xtset School Year
    xtrex Y_Dep X_Ind i.Year, vce(cluseter School)
    Click image for larger version

Name:	Random Effects Regression.png
Views:	1
Size:	57.0 KB
ID:	1492090


    I know that my model has several limitations, but I was hoping to capture some interesting things with my interaction effects. I also included panel data to try and make my results more convincing. Is my direction sensible? I feel like there might be a fatal flaw with my methods or I'm not on the right path, but maybe I've been staring at this for too long. Thank you so much in advance for your help. I hope this isn't too long or my questions aren't too broad.

    Last edited by Logan Valdez; 05 Apr 2019, 21:05.

  • #2
    I would say that overall you're on the right track. The biggest worry with this kind of model is the appropriateness of linearity. You seem to have paid considerable attention to that, I probably would not have handled the ethnicity variables the way you did--I dislike making dichotomies out of continuous variables. I would have found another way to transform them. But the proof of the pudding is in the eating. Your OLS model shows a nice R2. I trust you will explore plots of predicted vs observed values from your random effects model. That's really what counts. You do have a large number of variables, which can be worrisome about overfitting noise, but at a ratio of about 70 observations per predictor you're probably OK there as well.

    Looking at the output, the only thing that worries me is the huge coefficient for CharterAsian in both models. It's just so out of proportion to any of the main effects. Is there any good reason to expect such a large interaction effect there? I'm guessing it happened because there are very few charter schools that have an Asian majority? If I'm right about that, it would be another argument against dichotomizing the ethnicity variables and finding a good way to represent them continuously. If no simple transformation is effective in linearizing those relationships, consider using cubic splines. They make interpretation of the results more difficult, but if you are not trying to come up with a simple predictive model for ethnicity effects and are more interested in making qualitative statements about the effects of charter schools in certain groups, they can be very useful.

    Apart from statistical concerns, while I have no real expertise in this subject matter, as an actively involved citizen and parent of a child who has been educated in both public and private schools, this is an issue I have paid a bit of attention to. It is my impression that, at least on a national scale, the effectiveness of charter schools is highly dependent on the effectiveness with which they are regulated by government. Lax regulation leads to scamming and poor results, but careful oversight seems to promote good results. I gather your data are all from California, but my understanding is that California law governing charter schools leaves considerable discretion in these matters to local school districts. So I would expect a fair amount of regulatory variation, and I would expect that here, too, it would be one of the determinants of educational outcomes. So I'm wondering if there is some data you might use to represent that effect in your modeling. Just a thought.

    Comment

    Working...
    X