Comparing grades with different scales-zscore command; is it wise to use it in my case?

Neg Kha

Join Date: Jun 2022
Posts: 68

Comparing grades with different scales-zscore command; is it wise to use it in my case?

29 Dec 2023, 08:25

Hi,

I want to see the relationship between having housing and their grades for a group of students. The issue is that these students study in different fields and the grade scale is not the same across different fields and even within fields.
Let's say I have the dataset below: (id=identifier, totalpoints=grade, housing=binary variable for access to housing which is missing in most cases)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float totalpoints str31 Course_code str10 exam_date float(id housing)
  15 "EEMN21" "2022-10-27"  1 .
  12 "EEMN21" "2022-10-27"  2 1
  23 "EEMN21" "2022-10-27"  3 0
  25 "EEMN21" "2022-10-27"  4 .
  20 "EEMN21" "2022-10-27"  5 .
  15 "EEMN21" "2022-10-27"  6 .
21.5 "EEMN21" "2022-10-27"  7 .
  21 "EEMN21" "2022-10-27"  8 .
26.5 "EEMN21" "2022-10-27"  9 .
   5 "BUSN47" "2023-03-10" 10 .
   8 "BUSN47" "2023-03-10" 11 .
   5 "BUSN47" "2023-03-10" 12 .
   3 "BUSN47" "2023-03-10" 13 .
   6 "BUSN47" "2023-03-03" 14 1
   5 "BUSN47" "2023-03-10" 15 .
   5 "BUSN47" "2023-03-10" 16 .
   4 "BUSN47" "2023-03-10" 17 .
   3 "BUSN47" "2023-03-03" 18 .
   5 "BUSN47" "2023-03-03" 19 .
   5 "BUSN47" "2023-03-03" 20 .
   5 "BUSN47" "2023-03-03" 21 .
   7 "BUSN47" "2023-03-03" 22 .
   7 "BUSN47" "2023-03-10" 23 .
   6 "BUSN47" "2023-03-10" 24 .
   7 "BUSN47" "2023-03-10" 25 .
   6 "BUSN47" "2023-03-10" 26 .
   5 "BUSN47" "2023-03-10" 27 .
   6 "BUSN47" "2023-03-10" 28 .
   6 "BUSN47" "2023-03-10" 29 .
   4 "BUSN47" "2023-03-10" 30 .
   5 "BUSN47" "2023-03-10" 31 .
   7 "BUSN47" "2023-03-03" 32 .
   4 "BUSN47" "2023-03-03" 33 .
   5 "BUSN47" "2023-03-03" 34 .
   4 "BUSN47" "2023-03-10" 35 .
   4 "BUSN47" "2023-03-10" 36 .
   6 "BUSN47" "2023-03-10" 37 .
   6 "BUSN47" "2023-03-10" 38 .
   5 "BUSN47" "2023-03-10" 39 .
   6 "BUSN47" "2023-03-10" 40 .
   5 "BUSN47" "2023-03-10" 41 .
   2 "BUSN47" "2023-03-10" 42 .
   6 "BUSN47" "2023-03-10" 43 .
   7 "BUSN47" "2023-03-10" 44 .
   4 "BUSN47" "2023-03-10" 45 .
   5 "BUSN47" "2023-03-10" 46 .
   5 "BUSN47" "2023-03-10" 47 .
   5 "BUSN47" "2023-03-10" 48 .
   6 "BUSN47" "2023-03-10" 49 .
   5 "BUSN47" "2023-03-10" 50 .
   6 "BUSN47" "2023-03-10" 51 .
   5 "BUSN47" "2023-03-10" 52 .
   5 "BUSN47" "2023-03-10" 53 .
   7 "BUSN47" "2023-03-10" 54 .
   5 "BUSN47" "2023-03-10" 55 .
   5 "BUSN47" "2023-03-10" 56 .
   3 "BUSN47" "2023-03-10" 57 .
   5 "BUSN47" "2023-03-10" 58 .
   5 "BUSN47" "2023-03-10" 59 .
   7 "BUSN47" "2023-03-10" 60 .
end

Note that each course could have more than one exam date.
I was wondering what should be my outcome variable? Should I standardize the grade variable first (using the zscore command) and then use it as an outcome? Can I use "difference from course average" as an outcome?

Best

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17719
#2

29 Dec 2023, 09:35

Neg:
I would go:

Code:

. encode Course_code, g(num_Course_code) . logit housing i.num_Course_code##c.totalpoints note: 1.num_Course_code != 0 predicts success perfectly; 1.num_Course_code omitted and 1 obs not used. outcome = totalpoints <= 12 predicts data perfectly r(2000);

The disappointing results are related to the limited number of observations provided in excerpt (and unavoidably so) and the massive missingness of the regressand (-housing-).

Last edited by Carlo Lazzaro; 29 Dec 2023, 09:47.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30141
#3

29 Dec 2023, 09:50

I would not use "difference from course average" as an outcome because the variation of scores within courses may also differ, not just the mean values. I do think this is an instance where using a fully standardized variable is appropriate, even necessary.

Now, perhaps a more ideal approach, assuming that there are enough people who take multiple courses in different fields, it might be possible to find a way to "equate" scores in different courses, although I think that relies on an implicit assumption that the same person will show equivalent levels of ability in all courses, which seems implausible to me. The details of doing that, if it is possible, are well beyond my amateurish psychometric abilities. So I won't go on about that, but if you have a psychometrician available to talk to, he or she might be able to offer a better approach than just standardization.

But you have another gigantic problem: all those missing values on your focal predictor variable, housing. If the example you show is at all like the full data set, it is missing in the vast majority of observations. Unless you have a real handle on the real-world mechanism that causes all that missingness, there is no way you can have confidence that your available data isn't severely biased, yet have no way of even guessing in which direction, let alone by how much. This data set seems very unfit for purpose.

Added: Crossed with #2.
2 likes
Comment

Neg Kha

Join Date: Jun 2022
Posts: 68

29 Dec 2023, 09:58

Originally posted by Carlo Lazzaro View Post

Neg:
I would go:

Code:

. encode Course_code, g(num_Course_code)

. logit housing i.num_Course_code##c.totalpoints

note: 1.num_Course_code != 0 predicts success perfectly;
1.num_Course_code omitted and 1 obs not used.

outcome = totalpoints <= 12 predicts data perfectly
r(2000);

Hi! Thanks for the response. Maybe my question was confusing but grades are the outcome actually.
Also, in practice, I do not have the grades for everyone in a class. I have average grade, the standard deviation, maximum and minimum, so I cannot really use course FE.

Let's put my problem in the following form:

As you can see, the scales are completely different. How should I deal with the outcome grade to make it comparable across students?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id byte totalpoints str6 course_code str10 exam_date float(course_mean course_sd max min housing)
1 90 "NEKG33" "2022-10-20"  66.07692 22.473917 90 21 0
2 80 "NEKG33" "2022-10-20"  66.07692 22.473917 90 21 1
3 66 "NEKG33" "2022-11-28"  66.07692 22.473917 90 21 1
4 23 "FEKG51" "2022-11-08" 11.035433  6.622293 30  4 0
5  6 "FEKG51" "2022-11-08" 11.035433  6.622293 30  4 0
6  5 "EDAA10" "2023-01-09" 30.659575  11.60974 41  0 0
7 35 "MGTO51" "2023-01-13"   38.9375 4.6183724 45 30 1
8 38 "MGTO51" "2023-05-15"   38.9375 4.6183724 45 30 0
end

Last edited by Neg Kha; 29 Dec 2023, 10:02.

Comment

Neg Kha

Join Date: Jun 2022

Posts: 68
#5

29 Dec 2023, 10:03

Originally posted by Clyde Schechter View Post

I would not use "difference from course average" as an outcome because the variation of scores within courses may also differ, not just the mean values. I do think this is an instance where using a fully standardized variable is appropriate, even necessary.

Now, perhaps a more ideal approach, assuming that there are enough people who take multiple courses in different fields, it might be possible to find a way to "equate" scores in different courses, although I think that relies on an implicit assumption that the same person will show equivalent levels of ability in all courses, which seems implausible to me. The details of doing that, if it is possible, are well beyond my amateurish psychometric abilities. So I won't go on about that, but if you have a psychometrician available to talk to, he or she might be able to offer a better approach than just standardization.

But you have another gigantic problem: all those missing values on your focal predictor variable, housing. If the example you show is at all like the full data set, it is missing in the vast majority of observations. Unless you have a real handle on the real-world mechanism that causes all that missingness, there is no way you can have confidence that your available data isn't severely biased, yet have no way of even guessing in which direction, let alone by how much. This data set seems very unfit for purpose.

Added: Crossed with #2.

Thank you Clyde!

Yes it makes sense. My sample is a number of students who participated in a lottery to get some sort of housing. I want to look at the LATE; if housing has an association with the grades among the lottery participants. Hope it now makes more sense why I am doing it.
I have updates how my dataset actually looks like as below. I only have the class average grade, sd, max and min and I was hoping to make a nice comparison based on those

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id byte totalpoints str6 course_code str10 exam_date float(course_mean course_sd max min housing) 1 90 "NEKG33" "2022-10-20" 66.07692 22.473917 90 21 0 2 80 "NEKG33" "2022-10-20" 66.07692 22.473917 90 21 1 3 66 "NEKG33" "2022-11-28" 66.07692 22.473917 90 21 1 4 23 "FEKG51" "2022-11-08" 11.035433 6.622293 30 4 0 5 6 "FEKG51" "2022-11-08" 11.035433 6.622293 30 4 0 6 5 "EDAA10" "2023-01-09" 30.659575 11.60974 41 0 0 7 35 "MGTO51" "2023-01-13" 38.9375 4.6183724 45 30 1 8 38 "MGTO51" "2023-05-15" 38.9375 4.6183724 45 30 0 end
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17719
#6

29 Dec 2023, 10:05

Neg:
my bad.
However, as it is often the case, I do share the main point that Clyde made: the real plague here is the massive missingness of your main predictor (that previously I mistook for the dependent variable).
As you know, Stata applies listwise deletion; therefore, all the observations with at least one missing value will be omitted in the statistical analysis you coded.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30141
#7

29 Dec 2023, 10:09

OK, that looks much more suitable for analysis. I would "standardize" the average with -gen score = (totalpoints-course_mean)/course_sd- and use that as the outcome variable.
1 like
Comment

Announcement

Comparing grades with different scales-zscore command; is it wise to use it in my case?

Comment

Comment

Comment

Comment

Comment

Comment