Addressing selection bias when matching individuals' school test scores

Efan Fisher

Join Date: Jan 2025

Posts: 5
#1

Addressing selection bias when matching individuals' school test scores

08 Feb 2025, 03:44

I am investigating the impact of Latino students' English skills upon entry to university on their attainment upon completion. In Colombia, students sit standardised university entry and exit exams, both including an English, maths and Spanish language component. I have a dataset of 130,000 students across 383 universities.

Using a common identifier, I matched students in my exit-exam cohort of interest to their entry scores, this resulted in a loss of observations (the matching success rate is 35.1%). This is because the testing agency didn't manage to produce common keys for all students, it's not an issue with my methods of matching per se.

I want to address the potential selection bias that this introduces. The issue I face is that the treatment (the matching to entry scores) happens after the outcome variable is observed (exit exam scores). I am considering different ways of addressing this, but I am unsure of how appropriate they are in my case.

- Propensity Score Matching, Inverse Probability Weighting. However, my understanding is that I can't do this because treatment occurs after the outcome variable is observed.

- Heckman Selection Model. I would delete the exit scores for those students who aren't matched to their entry scores, treating them as if they were missing, as suggested in this thread.

Any thoughts on the validity of these approaches or alternative methods would be greatly appreciated.

Thank you in advance!

Last edited by Efan Fisher; 08 Feb 2025, 03:46.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3177
#2

09 Feb 2025, 08:16

Data 1: StudentID_1 entryscore
Data 2: StudentID_2 entryscore exitscore

But StudentID_1 and StudentID_2 only match up 35.1% of the time? Is there reason to believe that the matched IDs are related to scores or something else, or is it just random sampling?
Comment
Efan Fisher

Join Date: Jan 2025

Posts: 5
#3

09 Feb 2025, 09:29

Hi George, from deeper research (after this post), the lack of matching of test scores is due to some students using different forms of personal identification between sitting the entry and exit exams. In Colombia, there are separate cards for children and adults, so the ID they use is different depending on the age at which they take the entry exam.

For these students who can't be matched directly via their ID number, a matching algorithm is implemented using first name, surname and date of birth. It is this process which is imperfect and results in some students not being matched up.

I have data on the students' ID type used when presenting both exams. I am wondering about using this information to model the probability of being matched, using either a Heckman selection model or propensity score weighting.

Thank you for your help!
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#4

09 Feb 2025, 12:46

Sounds more like selection to me.

You might think of this as a sort of "attrition bias," though attrition here is more mechanical than respondent choices. Still, I'd think the statistical methods to address it are comparable. Random attrition seems reasonable since it's just a change in ID approach (though crossing a relevant age boundary might be a good instrument).
https://www.cesifo.org/en/publicatio...counting-panel
https://onlinelibrary.wiley.com/doi/....1002/hec.3206
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#5

09 Feb 2025, 12:50

might check this out too. https://www.statalist.org/forums/for...tribution-test
Comment
Efan Fisher

Join Date: Jan 2025

Posts: 5
#6

10 Feb 2025, 13:29

Hi George,

Thank you for the references, I will try these approaches and report back.

All the best!
Comment

Announcement

Addressing selection bias when matching individuals' school test scores

Comment

Comment

Comment

Comment

Comment