Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to estimate a multilevel model in Stata

    Dear all,

    I would like to investigate what determines price differences in housing prices across neighborhoods. In order to this, I've a dataset with transaction prices, the location of the house and some neighborhood specifics as the crime rate, the amount of inhabitants and the density.

    In the table below a little example of my dataset is given.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long price str22 local_authority long inhabitants_2001 double(density_per_hectare_2001 crime_rate_1000_2016)
     69950 "WANDSWORTH"             260380     76 192.92572394193104
    178000 "EALING"                 300948   54.2  77.48182410250276
    185000 "SOUTHWARK"              244866  84.86  45.11447077176905
    124300 "HAMMERSMITH AND FULHAM" 165242 100.73  170.9311192069813
    535000 "CITY OF WESTMINSTER"    181286  84.41  166.8689253444833
     54000 "CROYDON"                330587  38.21  84.81882227673805
     58500 "BARNET"                 314564  36.27  81.76396536157985
     81950 "KINGSTON UPON THAMES"   147273  39.54   241.205108879428
    195000 "CITY OF WESTMINSTER"    181286  84.41  166.8689253444833
     65000 "CAMDEN"                 198020  90.85 152.44924755075246
    end
    Since my experience with Stata is not that great, I'm wondering if someone could help me. In another research I saw that someone performed a multilevel analysis in order to estimate whether for example the crime rate is a determinant for price differences across neighborhoods.

    The research is conducted on the basis of the following (simplified) model:
    yij = β0 + uj + εij

    y​​​​​​​ij is the ith house in the jth neighborhood, β0 is the overall mean, uj is the neighborhood residual and εij is the error term.

    Could someone explain me how to estimate this model with the above described dataset in Stata?

    Thanks in advance.

    Berend

  • #2
    First, I assume that the variable local_authority is what you mean when you refer to "neighborhood" in the model. Is that correct?

    Next, the example of the data you show doesn't really lend itself to this kind of analysis because only one neighborhood, City of Westminster, has more than one observation. To really use these models you need multiple observations within each neighborhood, and overall a much larger data set. I assume that your full data will be suitable.

    So, there are two steps. The neighborhood variable has to be encoded numerically: strings won't work as grouping variables in these analyses. Then it's a straightforward application of the -mixed- command.

    Code:
    encode local_authority, gen(neighborhood)
    
    mixed price || neighborhood:
    But multi-level modeling is a complex undertaking, and you will likely want to incorporate other information into your model. Do read the chapter on -mixed- in the [ME] volume of the PDF documentation that comes with your Stata installation. It is clearly written and packed with worked-examples covering nearly all of the commonly encountered situations.

    Comment


    • #3
      Dear Clyde,

      Indeed the variable local_authority is in my explanation referred to as "neighborhood". My dataset originally has 72000 observations, so this should be enough observations.

      In my model I would like to include 10 such neighborhood specifics. I've made the following code after making some changes to the data from string to double etc.

      Code:
      mixed price inner_or_outer inhabitants_2001 density_per_hectare_2001 crime_rate_1000_2016 travel_to_work_km average_household_size Employment_Rate Median_Household_Income Percentage_Greenspace Percentage_Social_housing || neighborhood:
      However, this code gives me an endless result of:

      Iteration 0: log likelihood = -1033009.9
      Iteration 1: log likelihood = -1033009.9
      Iteration 2: log likelihood = -1033009.9 (backed up)
      Iteration 3: log likelihood = -1033009.9 (backed up)
      Iteration 4: log likelihood = -1033009.9 (backed up)
      Iteration 5: log likelihood = -1033009.9 (backed up)
      Iteration 6: log likelihood = -1033009.9 (backed up)
      Iteration 7: log likelihood = -1033009.9 (backed up)
      And so on

      Is there something I'm not doing good?

      With kind regards,

      Berend

      Comment


      • #4
        Well, this says that your model is not identifiable in the data and you will need to simplify your model. The question is which variables to eliminate. The first thing is to get Stata to offer you some clues as to which variable(s) it is choking on. So rerun your command specifying the option -iterate(5)- (I chose 5 because you are reaching the failure point at around 5 iterations.) Stata will run 5 iterations and then print out its results up to that point. Those results are not valid. But they are often useful for troubleshooting. Look for variables whose coefficients are outlandishly large or small, or whose standard errors are outlandishly large or small. Another possibility is that the variance component at the neighborhood level is actually negative or 0: this will also cause this kind of behavior, and in this interim output you will see that the output for that variance component is some ridiculously small number, very close to zero. This can happen, by the way, if the variables that are defined at the neighborhood level provide such a complete explanation of the average price at the neighborhood level that there is nothing left to be explained by the uj term in your model.

        Then retry your model eliminating the variable(s) that the output shows to be suspicious. If it is the variance component at the neighborhood level that comes out suspicious, then just use ordinary -regress- and forget about the multilevel modeling, as the data are saying that you don't actually have anything going on at the neighborhood level.

        That will typically solve the problem. Occasionally the interim output doesn't give any clues. In that case, you have to just start with a very simple model with one predictor, and keep adding in predictors one at a time for as long as you can until you hit convergence failure again. Evidently, if you are forced to use this approach, you want to start with your most important variable(s) first, and add in the less important ones later.

        Comment


        • #5
          Dear Clyde,

          When I rerun my model with the following specification, these are my outcomes.

          Code:
           
          mixed price inner_or_outer inhabitants_2001 density_per_hectare_2001 crime_rate_1000_2016 travel_to_work_km average_household_size Employment_Rate Median_Household_Income Percentage_Greenspace Percentage_Social_housing || neighborhood:, iterate(5)
          
                            price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          --------------------------+----------------------------------------------------------------
                     inner_or_outer |   13789.55   11055.38     1.25   0.212    -7878.595    35457.69
                   inhabitants_2001 |  -.0912243    .089337    -1.02   0.307    -.2663216     .083873
           density_per_hectare_2001 |   1215.647   437.0021     2.78   0.005     359.1389    2072.156
               crime_rate_1000_2016 |  -148.4757   98.65218    -1.51   0.132    -341.8304    44.87899
                  travel_to_work_km |  -8426.691   4447.447    -1.89   0.058    -17143.53    290.1445
             average_household_size |  -32598.08   36654.53    -0.89   0.374    -104439.6    39243.47
                    Employment_Rate |  -3618.385   885.0264    -4.09   0.000    -5353.005   -1883.766
            Median_Household_Income |   11915.88   1855.767     6.42   0.000     8278.645    15553.12
              Percentage_Greenspace |   937.4832   661.8877     1.42   0.157    -359.7929    2234.759
          Percentage_Social_housing |  -1595.429   724.1227    -2.20   0.028    -3014.683   -176.1745
                              _cons |   158224.9   177413.9     0.89   0.372      -189500    505949.8
          However, when I try to estimate the model with only two variables, this is my result.

          Code:
          mixed price density_per_hectare_2001 travel_to_work_km || neighborhood:
          
          Iteration 0:   log likelihood = -1027785.4  
          Iteration 1:   log likelihood = -1027785.4  (backed up)
          Iteration 2:   log likelihood = -1027785.4  (backed up)
          Iteration 3:   log likelihood = -1027785.4  (backed up)
          Iteration 4:   log likelihood = -1027785.4  (backed up)
          Iteration 5:   log likelihood = -1027785.4  (backed up)
          Iteration 6:   log likelihood = -1027785.4  (backed up)
          Iteration 7:   log likelihood = -1027785.4  (backed up)
          Iteration 8:   log likelihood = -1027785.4  (backed up)
          So just eliminating some of the 'suspicious' variables doesn't solve the problem. However, by chance I found that when I eliminate 20.000 of the 72.000 observations, Stata is able to run the regression. This is not the way to perform a valid regression, buy maybe it says something about what went wrong in the earlier regression.

          For thus far, thanks a lot for your help.

          Berend

          Comment

          Working...
          X