Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Three-tier Multi-Level Model: violation of the assumptions of normality and heteroscedasticity


    Dear colleagues, I am working with educational data. To do so, I am using the classic three-level hierarchical linear model (student, class and school).
    I'm using the stata version 17 for the analyses. . When I perform the residual analysis, the assumptions of homoscedasticity and normality are not met. Here is the adjusted model: xtmixed pt_ex_9mat gen_alun rep_alun comp_alun b4.educ_ee gen_prof b1.idad_prof nro_alun_turm b1.sase_esc b1.reg_esc area_esc||id_esc:||id_turm:,mle var

    Comments:
    1. The dependent variable pt_ex (Exam Score) despite being continuous, has only discrete values ​​(0 to 100).
    2. Regarding the independent variables, with the exception of the variable nro_alun_turm (Number of students in the class), these are nominal/binary categorical.
    I thought of using a GLM, namely a Poisson or Negative Binomial multilevel model, but these have infinite support. So, could you try a Gamma, since the two assumptions of the Gaussian model were not met?

    can ypu help me with this? any suggestions?

    thanks in advance,



    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(gen_alun rep_alun comp_alun educ_ee) long gen_prof byte idad_prof float nro_alun_turm long sase_esc byte(reg_esc area_esc)
    1 0 0 2 1 1 20 1 2 1
    1 0 0 6 1 1 20 1 2 1
    1 0 0 6 1 1 20 1 2 1
    1 0 0 1 1 1 20 1 2 1
    1 1 0 1 1 1 20 1 2 1
    1 1 0 2 1 1 20 1 2 1
    0 1 0 3 1 1 20 1 2 1
    0 0 0 2 1 1 20 1 2 1
    0 0 0 6 1 1 20 1 2 1
    0 1 0 2 1 1 20 1 2 1
    1 0 0 2 1 1 26 1 2 1
    0 0 0 3 1 1 26 1 2 1
    1 1 0 6 1 1 26 1 2 1
    1 0 0 6 1 1 26 1 2 1
    1 0 0 1 1 1 26 1 2 1
    0 0 0 6 1 1 17 1 2 1
    0 1 0 6 1 1 17 1 2 1
    0 1 0 1 1 1 17 1 2 1
    0 1 0 1 1 1 17 1 2 1
    0 0 0 6 1 1 17 1 2 1
    1 0 0 6 1 1 17 1 2 1
    0 1 0 1 1 1 17 1 2 1
    1 1 0 1 1 1 17 1 2 1
    0 0 0 2 1 1 17 1 2 1
    1 0 0 6 1 1 17 1 2 1
    1 0 0 1 1 1 17 1 2 1
    0 0 0 2 1 1 17 1 2 1
    0 1 1 1 1 1 17 1 2 1
    1 0 0 2 1 1 17 1 2 1
    1 0 0 1 1 1 17 1 2 1
    0 1 0 4 1 1 17 1 2 1
    0 0 0 4 1 1 17 1 2 1
    0 1 0 3 1 1 26 1 2 1
    1 1 0 1 1 1 26 1 2 1
    1 1 0 4 1 1 26 1 2 1
    1 0 0 2 1 1 26 1 2 1
    0 1 0 5 1 1 26 1 2 1
    0 1 0 1 1 1 26 1 2 1
    1 1 0 6 1 1 26 1 2 1
    0 0 1 2 1 1 26 1 2 1
    1 0 0 4 1 1 26 1 2 1
    1 0 0 2 1 1 26 1 2 1
    0 0 0 6 1 1 26 1 2 1
    0 1 0 1 1 1 26 1 2 1
    1 1 0 6 1 1 26 1 2 1
    1 1 0 1 1 1 26 1 2 1
    1 0 0 1 1 1 26 1 2 1
    1 0 0 3 1 1 26 1 2 1
    1 0 0 3 1 1 26 1 2 1
    0 0 0 6 1 1 26 1 2 1
    0 1 0 1 1 1 26 1 2 1
    1 1 1 6 1 1 26 1 2 1
    1 0 0 1 1 2 20 2 2 1
    0 1 0 2 1 2 20 2 2 1
    1 1 0 1 1 2 20 2 2 1
    1 0 0 1 1 2 20 2 2 1
    1 0 0 6 1 2 20 2 2 1
    1 0 0 2 1 2 20 2 2 1
    1 0 0 1 1 2 20 2 2 1
    1 0 0 6 1 2 20 2 2 1
    1 0 0 1 1 2 20 2 2 1
    0 0 0 3 1 2 20 2 2 1
    0 1 0 3 1 2 20 2 2 1
    1 1 0 6 1 2 20 2 2 1
    0 0 0 6 1 2 20 2 2 1
    0 1 0 3 1 2 20 2 2 1
    1 0 0 3 1 2 20 2 2 1
    1 0 0 3 1 2 20 2 2 1
    1 0 0 6 1 2 20 2 2 1
    1 1 0 6 1 2 20 2 2 1
    1 0 0 1 1 2 20 2 2 1
    0 1 0 2 1 2 20 2 2 1
    0 1 0 1 0 1 21 1 2 1
    0 1 0 1 0 1 21 1 2 1
    0 1 1 2 0 1 21 1 2 1
    1 0 0 1 0 1 21 1 2 1
    0 1 0 1 0 1 21 1 2 1
    0 1 0 1 0 1 21 1 2 1
    1 0 0 6 0 1 21 1 2 1
    1 1 0 1 0 1 21 1 2 1
    1 1 0 6 0 1 21 1 2 1
    1 0 0 1 0 1 21 1 2 1
    1 0 1 1 0 1 21 1 2 1
    1 0 0 4 0 1 21 1 2 1
    0 0 0 3 0 1 21 1 2 1
    1 0 1 5 0 1 21 1 2 1
    1 0 0 6 0 1 21 1 2 1
    1 0 0 2 0 1 21 1 2 1
    1 0 0 1 0 1 21 1 2 1
    0 0 0 1 0 1 21 1 2 1
    1 0 0 6 0 1 21 1 2 1
    0 1 0 6 0 2 21 1 2 1
    0 0 0 2 0 2 21 1 2 1
    0 0 0 6 0 2 21 1 2 1
    0 0 0 2 0 2 21 1 2 1
    1 0 0 2 0 2 21 1 2 1
    1 1 0 4 0 2 21 1 2 1
    1 1 0 2 0 2 21 1 2 1
    0 0 0 3 0 2 21 1 2 1
    1 0 0 2 0 2 21 1 2 1
    end
    label values gen_alun nomegen_alun
    label def nomegen_alun 0 "Feminino", modify
    label def nomegen_alun 1 "Masculino", modify
    label values rep_alun rep_alun_2
    label def rep_alun_2 0 "Não", modify
    label def rep_alun_2 1 "Sim", modify
    label values comp_alun comp_alun_2
    label def comp_alun_2 0 "Não", modify
    label def comp_alun_2 1 "Sim", modify
    label values educ_ee educ_ee_nova_2
    label def educ_ee_nova_2 1 "Sem habilitação", modify
    label def educ_ee_nova_2 2 "1º ciclo", modify
    label def educ_ee_nova_2 3 "3º ciclo", modify
    label def educ_ee_nova_2 4 "Secundário", modify
    label def educ_ee_nova_2 5 "Ensino superior", modify
    label def educ_ee_nova_2 6 "Não sabe", modify
    label values gen_prof gen_prof_2
    label def gen_prof_2 0 "Feminino", modify
    label def gen_prof_2 1 "Masculino", modify
    label values idad_prof idad_prof_test_3
    label def idad_prof_test_3 1 "Até 40 anos", modify
    label def idad_prof_test_3 2 "De 40 a 50 anos", modify
    label values sase_esc sase_esc_2
    label def sase_esc_2 1 "Grupo 1", modify
    label def sase_esc_2 2 "Grupo 2", modify
    label values reg_esc nomeid_regiao
    label def nomeid_regiao 2 "Nordeste", modify
    label values area_esc nomeid_area
    label def nomeid_area 1 "Interior", modify


    Residual plot:
    Click image for larger version

Name:	Imagem2.jpg
Views:	1
Size:	169.1 KB
ID:	1676469
    Click image for larger version

Name:	Imagem3.jpg
Views:	1
Size:	49.7 KB
ID:	1676470
    Click image for larger version

Name:	Imagem5.jpg
Views:	1
Size:	62.6 KB
ID:	1676471
    Click image for larger version

Name:	Imagem7.jpg
Views:	1
Size:	96.7 KB
ID:	1676472





  • #2
    If you use cluster robust standard errors, you don't have to worry about these distributional problems, and that will also deal with non-independence.

    As for the dependent variable being discrete and taking on integer values between 0 and 100, that, too, can be ignored. While your example data does not include the outcome variable, from the graphs you show, I'm inferring that it takes on many of those values, and that is good enough. I would only be deterred if of those possible values between 0 and 100 only a small number were instantiated: that would be a truly discrete variable. If you think about it, in any finite data sample, there are only finitely many values observed, but unless they are spaced very far apart from each other and few in number, this doesn't prevent treating it as a continuous variable.

    As for the distributions in GLMs having infinite support, again, in any real-world data sample, only finitely many outcome values occur. This is just not an issue.

    The distributional issues you raise are those used in mathematical statistics to derive the sampling distributions of the coefficients and standard errors and justify the use of normal-theory inference in large samples. In real world data, it is extraordinarily rare that these assumptions hold up to a fine-grained level of exploration. But these procedures are robust to reasonable amounts of violation of those assumptions, and the use of robust (or, with nested data, cluster robust) standard errors deals with it adequately. As a practical matter, none of the departures from assumptions you show are sufficiently gross to be of concern.

    Comment


    • #3

      Thank you Clyde Two questions: 1) Regarding the dependent variable being discrete or continuous, although it assumes values ​​between 0 and 100, these assume values ​​multiples of 4 (0,4,8,12,16,..100), since the grade (DV) is a score in a 25-question exam. So can I continue to treat it as a continuous variable? 2) Would this code be in stata: "vce(robust)" to get clustered robust errors? I did it with the two codes below, but the model is still heteroscedastic. xtmixed en_ex_9mat gen_alun rep_alun comp_alun b4.educ_ee gen_prof b1.idad_prof nro_alun_turm b1.sase_esc b1.reg_esc area_esc||id_esc:||id_turm:,mle var cluster(id_esc) xtmixed pt_ex_9mat gen_alun rep_alun comp_alun b4.educ_ee gen_prof b1.idad_prof nro_alun_turm b1.sase_esc b1.reg_esc area_esc||id_esc:||id_turm:,mle var vce(robust) Note: I put the random effect at level 3. However, being the model three levels would I have to cluster level 2 as well?

      Comment


      • #4
        Originally posted by Ricardo Linhares View Post
        Thank you Clyde Two questions: 1) Regarding the dependent variable being discrete or continuous, although it assumes values ​​between 0 and 100, these assume values ​​multiples of 4 (0,4,8,12,16,..100), since the grade (DV) is a score in a 25-question exam. So can I continue to treat it as a continuous variable? 2) Would this code be in stata: "vce(robust)" to get clustered robust errors? I did it with the two codes below, but the model is still heteroscedastic. xtmixed en_ex_9mat gen_alun rep_alun comp_alun b4.educ_ee gen_prof b1.idad_prof nro_alun_turm b1.sase_esc b1.reg_esc area_esc||id_esc:||id_turm:,mle var cluster(id_esc) xtmixed pt_ex_9mat gen_alun rep_alun comp_alun b4.educ_ee gen_prof b1.idad_prof nro_alun_turm b1.sase_esc b1.reg_esc area_esc||id_esc:||id_turm:,mle var vce(robust) Note: I put the random effect at level 3. However, being the model three levels would I have to cluster level 2 as well?
        Clyde what do you think?

        Comment


        • #5
          Not -vce(robust)-. -vce(cluster id_esc)-.

          And using a (cluster) robust standard error does not remove the heteroscedasticity and non-normality. It makes them ignorable--the standard errors are valid even in the face of those distributional issues, and also to intra-class correlation.

          Regarding the dependent variable being discrete or continuous, although it assumes values ​​between 0 and 100, these assume values ​​multiples of 4 (0,4,8,12,16,..100), since the grade (DV) is a score in a 25-question exam.
          It's still fine to treat the dependent variable as if it were continuous.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Not -vce(robust)-. -vce(cluster id_esc)-.

            And using a (cluster) robust standard error does not remove the heteroscedasticity and non-normality. It makes them ignorable--the standard errors are valid even in the face of those distributional issues, and also to intra-class correlation.


            It's still fine to treat the dependent variable as if it were continuous.
            Good afternoon Clyde, Thank you for your help. Let me see if I understand, using the robust standard errors "vce(cluster id_esc)", should I still be concerned with the problem of heteroscedasticity and normality? Another question, if the standard errors of levels 2 (class) and level 3 (school) are correlated, would this imply problems in the diagnosis, specifically in the analysis of residues? Thank you very much

            Comment


            • #7
              Let me see if I understand, using the robust standard errors "vce(cluster id_esc)", should I still be concerned with the problem of heteroscedasticity and normality?
              Using -vce(cluster id_esc)- you can ignore heteroscedasticity and normality; these standard errors will be correct regardless.

              if the standard errors of levels 2 (class) and level 3 (school) are correlated, would this imply problems in the diagnosis, specifically in the analysis of residues?
              I don't understand this question. What does it mean for the standard errors of levels 2 and 3 to be correlated? They are just single numbers, not variables.

              Comment

              Working...
              X