Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Addressing selection bias when matching individuals' school test scores

    I am investigating the impact of Latino students' English skills upon entry to university on their attainment upon completion. In Colombia, students sit standardised university entry and exit exams, both including an English, maths and Spanish language component. I have a dataset of 130,000 students across 383 universities.

    Using a common identifier, I matched students in my exit-exam cohort of interest to their entry scores, this resulted in a loss of observations (the matching success rate is 35.1%). This is because the testing agency didn't manage to produce common keys for all students, it's not an issue with my methods of matching per se.

    I want to address the potential selection bias that this introduces. The issue I face is that the treatment (the matching to entry scores) happens after the outcome variable is observed (exit exam scores). I am considering different ways of addressing this, but I am unsure of how appropriate they are in my case.

    - Propensity Score Matching, Inverse Probability Weighting. However, my understanding is that I can't do this because treatment occurs after the outcome variable is observed.

    - Heckman Selection Model. I would delete the exit scores for those students who aren't matched to their entry scores, treating them as if they were missing, as suggested in this thread.

    Any thoughts on the validity of these approaches or alternative methods would be greatly appreciated.

    Thank you in advance!
    Last edited by Efan Fisher; 08 Feb 2025, 03:46.

  • #2
    Data 1: StudentID_1 entryscore
    Data 2: StudentID_2 entryscore exitscore

    But StudentID_1 and StudentID_2 only match up 35.1% of the time? Is there reason to believe that the matched IDs are related to scores or something else, or is it just random sampling?

    Comment


    • #3
      Hi George, from deeper research (after this post), the lack of matching of test scores is due to some students using different forms of personal identification between sitting the entry and exit exams. In Colombia, there are separate cards for children and adults, so the ID they use is different depending on the age at which they take the entry exam.

      For these students who can't be matched directly via their ID number, a matching algorithm is implemented using first name, surname and date of birth. It is this process which is imperfect and results in some students not being matched up.

      I have data on the students' ID type used when presenting both exams. I am wondering about using this information to model the probability of being matched, using either a Heckman selection model or propensity score weighting.

      Thank you for your help!

      Comment


      • #4
        Sounds more like selection to me.

        You might think of this as a sort of "attrition bias," though attrition here is more mechanical than respondent choices. Still, I'd think the statistical methods to address it are comparable. Random attrition seems reasonable since it's just a change in ID approach (though crossing a relevant age boundary might be a good instrument).
        https://www.cesifo.org/en/publicatio...counting-panel
        https://onlinelibrary.wiley.com/doi/....1002/hec.3206

        Comment


        • #5
          might check this out too. https://www.statalist.org/forums/for...tribution-test

          Comment


          • #6
            Hi George,

            Thank you for the references, I will try these approaches and report back.

            All the best!

            Comment

            Working...
            X