How to estimate a multilevel model in Stata

Berend Nijhuis

Join Date: Mar 2018

Posts: 17
#1

How to estimate a multilevel model in Stata

13 Mar 2018, 09:14

Dear all,

I would like to investigate what determines price differences in housing prices across neighborhoods. In order to this, I've a dataset with transaction prices, the location of the house and some neighborhood specifics as the crime rate, the amount of inhabitants and the density.

In the table below a little example of my dataset is given.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long price str22 local_authority long inhabitants_2001 double(density_per_hectare_2001 crime_rate_1000_2016) 69950 "WANDSWORTH" 260380 76 192.92572394193104 178000 "EALING" 300948 54.2 77.48182410250276 185000 "SOUTHWARK" 244866 84.86 45.11447077176905 124300 "HAMMERSMITH AND FULHAM" 165242 100.73 170.9311192069813 535000 "CITY OF WESTMINSTER" 181286 84.41 166.8689253444833 54000 "CROYDON" 330587 38.21 84.81882227673805 58500 "BARNET" 314564 36.27 81.76396536157985 81950 "KINGSTON UPON THAMES" 147273 39.54 241.205108879428 195000 "CITY OF WESTMINSTER" 181286 84.41 166.8689253444833 65000 "CAMDEN" 198020 90.85 152.44924755075246 end

Since my experience with Stata is not that great, I'm wondering if someone could help me. In another research I saw that someone performed a multilevel analysis in order to estimate whether for example the crime rate is a determinant for price differences across neighborhoods.

The research is conducted on the basis of the following (simplified) model:
y_ij = β₀ + u_j + ε_ij

y_ij is the ith house in the jth neighborhood, β₀is the overall mean, u_jis the neighborhood residual and ε_ij is the error term.

Could someone explain me how to estimate this model with the above described dataset in Stata?

Thanks in advance.

Berend
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

13 Mar 2018, 09:41

First, I assume that the variable local_authority is what you mean when you refer to "neighborhood" in the model. Is that correct?

Next, the example of the data you show doesn't really lend itself to this kind of analysis because only one neighborhood, City of Westminster, has more than one observation. To really use these models you need multiple observations within each neighborhood, and overall a much larger data set. I assume that your full data will be suitable.

So, there are two steps. The neighborhood variable has to be encoded numerically: strings won't work as grouping variables in these analyses. Then it's a straightforward application of the -mixed- command.

Code:

encode local_authority, gen(neighborhood) mixed price || neighborhood:

But multi-level modeling is a complex undertaking, and you will likely want to incorporate other information into your model. Do read the chapter on -mixed- in the [ME] volume of the PDF documentation that comes with your Stata installation. It is clearly written and packed with worked-examples covering nearly all of the commonly encountered situations.
Comment
Berend Nijhuis

Join Date: Mar 2018

Posts: 17
#3

13 Mar 2018, 12:44

Dear Clyde,

Indeed the variable local_authority is in my explanation referred to as "neighborhood". My dataset originally has 72000 observations, so this should be enough observations.

In my model I would like to include 10 such neighborhood specifics. I've made the following code after making some changes to the data from string to double etc.

Code:

mixed price inner_or_outer inhabitants_2001 density_per_hectare_2001 crime_rate_1000_2016 travel_to_work_km average_household_size Employment_Rate Median_Household_Income Percentage_Greenspace Percentage_Social_housing || neighborhood:

However, this code gives me an endless result of:

Iteration 0: log likelihood = -1033009.9
Iteration 1: log likelihood = -1033009.9
Iteration 2: log likelihood = -1033009.9 (backed up)
Iteration 3: log likelihood = -1033009.9 (backed up)
Iteration 4: log likelihood = -1033009.9 (backed up)
Iteration 5: log likelihood = -1033009.9 (backed up)
Iteration 6: log likelihood = -1033009.9 (backed up)
Iteration 7: log likelihood = -1033009.9 (backed up)
And so on

Is there something I'm not doing good?

With kind regards,

Berend
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#4

13 Mar 2018, 15:22

Well, this says that your model is not identifiable in the data and you will need to simplify your model. The question is which variables to eliminate. The first thing is to get Stata to offer you some clues as to which variable(s) it is choking on. So rerun your command specifying the option -iterate(5)- (I chose 5 because you are reaching the failure point at around 5 iterations.) Stata will run 5 iterations and then print out its results up to that point. Those results are not valid. But they are often useful for troubleshooting. Look for variables whose coefficients are outlandishly large or small, or whose standard errors are outlandishly large or small. Another possibility is that the variance component at the neighborhood level is actually negative or 0: this will also cause this kind of behavior, and in this interim output you will see that the output for that variance component is some ridiculously small number, very close to zero. This can happen, by the way, if the variables that are defined at the neighborhood level provide such a complete explanation of the average price at the neighborhood level that there is nothing left to be explained by the u_j term in your model.

Then retry your model eliminating the variable(s) that the output shows to be suspicious. If it is the variance component at the neighborhood level that comes out suspicious, then just use ordinary -regress- and forget about the multilevel modeling, as the data are saying that you don't actually have anything going on at the neighborhood level.

That will typically solve the problem. Occasionally the interim output doesn't give any clues. In that case, you have to just start with a very simple model with one predictor, and keep adding in predictors one at a time for as long as you can until you hit convergence failure again. Evidently, if you are forced to use this approach, you want to start with your most important variable(s) first, and add in the less important ones later.
Comment

Berend Nijhuis

Join Date: Mar 2018
Posts: 17

14 Mar 2018, 02:12

Dear Clyde,

When I rerun my model with the following specification, these are my outcomes.

Code:

 
mixed price inner_or_outer inhabitants_2001 density_per_hectare_2001 crime_rate_1000_2016 travel_to_work_km average_household_size Employment_Rate Median_Household_Income Percentage_Greenspace Percentage_Social_housing || neighborhood:, iterate(5)

                  price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------------+----------------------------------------------------------------
           inner_or_outer |   13789.55   11055.38     1.25   0.212    -7878.595    35457.69
         inhabitants_2001 |  -.0912243    .089337    -1.02   0.307    -.2663216     .083873
 density_per_hectare_2001 |   1215.647   437.0021     2.78   0.005     359.1389    2072.156
     crime_rate_1000_2016 |  -148.4757   98.65218    -1.51   0.132    -341.8304    44.87899
        travel_to_work_km |  -8426.691   4447.447    -1.89   0.058    -17143.53    290.1445
   average_household_size |  -32598.08   36654.53    -0.89   0.374    -104439.6    39243.47
          Employment_Rate |  -3618.385   885.0264    -4.09   0.000    -5353.005   -1883.766
  Median_Household_Income |   11915.88   1855.767     6.42   0.000     8278.645    15553.12
    Percentage_Greenspace |   937.4832   661.8877     1.42   0.157    -359.7929    2234.759
Percentage_Social_housing |  -1595.429   724.1227    -2.20   0.028    -3014.683   -176.1745
                    _cons |   158224.9   177413.9     0.89   0.372      -189500    505949.8

However, when I try to estimate the model with only two variables, this is my result.

Code:

mixed price density_per_hectare_2001 travel_to_work_km || neighborhood:

Iteration 0:   log likelihood = -1027785.4  
Iteration 1:   log likelihood = -1027785.4  (backed up)
Iteration 2:   log likelihood = -1027785.4  (backed up)
Iteration 3:   log likelihood = -1027785.4  (backed up)
Iteration 4:   log likelihood = -1027785.4  (backed up)
Iteration 5:   log likelihood = -1027785.4  (backed up)
Iteration 6:   log likelihood = -1027785.4  (backed up)
Iteration 7:   log likelihood = -1027785.4  (backed up)
Iteration 8:   log likelihood = -1027785.4  (backed up)

So just eliminating some of the 'suspicious' variables doesn't solve the problem. However, by chance I found that when I eliminate 20.000 of the 72.000 observations, Stata is able to run the regression. This is not the way to perform a valid regression, buy maybe it says something about what went wrong in the earlier regression.

For thus far, thanks a lot for your help.

Berend

Announcement

How to estimate a multilevel model in Stata

Comment

Comment

Comment

Comment