Can anyone help me for the scaling procedure of PISA WLE scores?

Doris Lin

Join Date: Sep 2019

Posts: 2
#1

Can anyone help me for the scaling procedure of PISA WLE scores?

08 Sep 2019, 06:54

Dear Sir/Madam,

I am interested in the scaling procedure of WLE scores of PISA 2015 databases. The descriptions about how to compute WLE scores from the PISA 2015 technical report are listed below:

"For the regular scales, international item and person parameters were obtained from a GPCM (see formula 16.3) in a single analysis based on data from all persons in all countries using the mdltm software (von Davier, 2008). For each scale, only persons with a minimum number of three valid responses were included. Students were weighted using the final student weight (W_FSTUWT), and all countries contributed equally to the estimation. Additional analyses on the invariance of item parameters across countries and languages were conducted and unique parameters were assigned if necessary (see section “Cross-country comparability” in this chapter). Once this process was finished, weighted likelihood estimates (WLE; Warm, 1989) were used as individual participant scores and transformed to an international metric with an OECD mean of zero and an OECD standard deviation of one."

I tried to use Stata 14.2 to compute the WLE scores but can't work successfully.
What the command I used are:

irt gpcm varlists [weight=w_fstuwt]

The results are like the picture below. But the "Discrim value" from the results are not the same as the data in PISA technical report(showed as the second picture below).
Can anyone help me? What's wrong with it? And how to get the WLE scores at next step by following with this procedure?
May I use the command "predict" ?

Thank you very much for help.

Sincerely,
Doris
Tags: None
Dany Shakeel

Join Date: Aug 2019

Posts: 24
#2

18 Dec 2019, 07:15

Doris were you able to resolve the issue in STATA? I am also working with PISA data and I want to construct WLE scores.
Dany
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

19 Dec 2019, 12:50

I don't have access to the data. However, do note what Doris said:

Additional analyses on the invariance of item parameters across countries and languages were conducted and unique parameters were assigned if necessary (see section “Cross-country comparability” in this chapter). Once this process was finished, weighted likelihood estimates (WLE; Warm, 1989) were used as individual participant scores and transformed to an international metric with an OECD mean of zero and an OECD standard deviation of one."

That is: the PISA authors did differential item function analysis. They may have specified which items they treated as having DIF (and of what type, i.e. uniform DIF or non-uniform) in the section. You would need access to that report to replicate their analysis. Without a sample of the actual output from Stata, we can't tell what, if anything, went wrong either.

Also, the scores are predicted using WLE. I am not sure if Stata can do this; the options include empirical Bayes means (default) and modes. In the documentation for the R package mirt, I see options corresponding to these two, as well as one for WLE, and one for ML. This leads me to believe that WLE is somehow different from EB means. I'm not sure how. It may not make a substantive difference in the end, but we can't tell.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Doris Lin

Join Date: Sep 2019

Posts: 2
#4

05 Jan 2020, 04:06

First of all, I am sorry for my late reply. Let me explain more clearly about my question.

I used these steps as below to generate my WLE scale.
My interested WLE scale is IBTEACH, which is calculated from PISA students questionnaire item "ST098" by running IRT GPCM mode and then predicting by post estimation.
The PISA 2015 database uses 8 sub-items of ST098 (st098q01ta, st098q02ta, st098q03na, st098q05ta, st098q06ta, st098q07ta, st098q08na, st098q09ta) to create the "IBTEACH" scale.
The samples which PISA used are including all persons from all countries but only persons with a minimum number of three valid responses were included.
(There are totally 519,334 observations from PISA 2015 databases.)

Steps:
1.
The ST098-items which students responded on a four-point Likert scale were reverse-coded so that I have to convert the answers in the first moment.
For example, generate a new variable and let it be 1 if the st098q01ta is 4, and let it be 4 if the st098q01ta is 1, etc.
Have to do this reverse-coded conversion step to all 8 sub-items of ST098. So that I generated 8 new variables. I named them q1-q9(without q4).
And convert missing values to missing values.

In order to avoid misunderstanding caused by my poor English, I directly posted one of the example command as follows.

. gen q9=0

. replace q9=4 if st098q09ta==1
(102957 real changes made)

. replace q9=3 if st098q09ta==2
(130499 real changes made)

. replace q9=2 if st098q09ta==3
(142211 real changes made)

. replace q9=1 if st098q09ta==4
(58492 real changes made)

. replace q9=.n if st098q09ta==.n
(2107 real changes made, 2107 to missing)

. replace q9=.m if st098q09ta==.m
(19636 real changes made, 19636 to missing)

2.
Because the PISA technical report said " For each scale, only persons with a minimum number of three valid responses were included",
I generate a variable to show the individual's missing answer number from q1 to q9 (without q4).

. egen nmis=rmiss(q1 q2 q3 q5 q6 q7 q8 q9)

. tab nmis

nmis | Freq. Percent Cum.
------------+-----------------------------------
0 | 411,541 79.24 79.24
1 | 22,818 4.39 83.64
2 | 2,932 0.56 84.20
3 | 1,190 0.23 84.43
4 | 1,158 0.22 84.65
5 | 1,017 0.20 84.85
6 | 1,040 0.20 85.05
7 | 1,744 0.34 85.39
8 | 75,894 14.61 100.00
------------+-----------------------------------
Total | 519,334 100.00

. sum ibteach

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
ibteach | 440,606 .0938612 1.012682 -3.3405 3.1829

We can find that the sum of nmis<=5 is exactly 440,606, the obs of "IBTEACH" scale.
So that we just need to use those 440,606 obs to run IRT GPCM.

3.
Run the IRT GPCM and weight by students weight (variable name : w_fstuwt) to get the parameters.
This step takes a long time, maybe ten minutes or more.

. irt gpcm q1-q9 [iweight= w_fstuwt] if nmis<=5

Fitting fixed-effects model:

Iteration 0: log likelihood = -2.289e+08
Iteration 1: log likelihood = -2.289e+08

Fitting full model:

Iteration 0: log likelihood = -2.107e+08
Iteration 1: log likelihood = -2.036e+08
Iteration 2: log likelihood = -2.032e+08
Iteration 3: log likelihood = -2.032e+08
Iteration 4: log likelihood = -2.032e+08
Iteration 5: log likelihood = -2.032e+08

Generalized partial credit model Number of obs = 440,656
Log likelihood = -2.032e+08
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
q1 |
Discrim | .8882858 .0004929 1802.33 0.000 .8873198 .8892518
Diff |
2 vs 1 | -2.264344 .0012302 -1840.57 0.000 -2.266755 -2.261933
3 vs 2 | -.3883437 .0006823 -569.13 0.000 -.3896811 -.3870063
4 vs 3 | .2314781 .0006642 348.49 0.000 .2301763 .23278
-------------+----------------------------------------------------------------
q2 |
Discrim | 1.004779 .0005474 1835.65 0.000 1.003707 1.005852
Diff |
2 vs 1 | -.6638453 .0005817 -1141.21 0.000 -.6649854 -.6627052
3 vs 2 | 1.557733 .0009308 1673.55 0.000 1.555909 1.559557
4 vs 3 | 1.626717 .0010788 1507.92 0.000 1.624603 1.628832
-------------+----------------------------------------------------------------
q3 |
Discrim | 1.428847 .0007516 1900.97 0.000 1.427374 1.43032
Diff |
2 vs 1 | -.790332 .0004943 -1598.98 0.000 -.7913008 -.7893633
3 vs 2 | .6300767 .0005276 1194.28 0.000 .6290427 .6311107
4 vs 3 | 1.137401 .0006364 1787.14 0.000 1.136154 1.138648
-------------+----------------------------------------------------------------
q5 |
Discrim | 1.448833 .0007651 1893.64 0.000 1.447334 1.450333
Diff |
2 vs 1 | -1.210142 .0005826 -2077.02 0.000 -1.211284 -1.209
3 vs 2 | .3131231 .0004716 663.97 0.000 .3121988 .3140473
4 vs 3 | 1.036768 .0005772 1796.12 0.000 1.035636 1.037899
-------------+----------------------------------------------------------------
q6 |
Discrim | 1.17227 .0006268 1870.35 0.000 1.171041 1.173498
Diff |
2 vs 1 | -1.911364 .000904 -2114.41 0.000 -1.913136 -1.909592
3 vs 2 | -.2076377 .000526 -394.77 0.000 -.2086686 -.2066068
4 vs 3 | .6297037 .0005603 1123.84 0.000 .6286055 .6308019
-------------+----------------------------------------------------------------
q7 |
Discrim | 1.219602 .0006705 1818.90 0.000 1.218288 1.220916
Diff |
2 vs 1 | .0369019 .0005004 73.75 0.000 .0359211 .0378826
3 vs 2 | 1.20278 .0007448 1614.81 0.000 1.20132 1.20424
4 vs 3 | 1.346852 .0008739 1541.16 0.000 1.345139 1.348565
-------------+----------------------------------------------------------------
q8 |
Discrim | 1.332424 .000721 1847.96 0.000 1.331011 1.333837
Diff |
2 vs 1 | -.2280479 .0004643 -491.16 0.000 -.2289579 -.2271379
3 vs 2 | .9579274 .0006303 1519.70 0.000 .9566919 .9591628
4 vs 3 | 1.335969 .0007685 1738.35 0.000 1.334462 1.337475
-------------+----------------------------------------------------------------
q9 |
Discrim | 1.035103 .0005591 1851.30 0.000 1.034007 1.036199
Diff |
2 vs 1 | -1.577677 .0008405 -1876.99 0.000 -1.579324 -1.576029
3 vs 2 | -.0015175 .0005948 -2.55 0.011 -.0026832 -.0003518
4 vs 3 | .5782089 .0006332 913.10 0.000 .5769678 .57945
------------------------------------------------------------------------------

4.
Predict the new scale.

. predict nibteach, latent
(option ebmeans assumed)
(using 7 quadrature points)

Run the correlation table and find the new-IBTEACH I created has 0.9771 correlation coefficient with the original PISA 2015 IBTEACH.

. pwcorr nibteach ibteach,star(.01)

| nibteach ibteach
-------------+------------------
nibteach | 1.0000
ibteach | 0.9771* 1.0000

As the above steps and results, I couldn't get exactly the same parameters values with PISA technical report showed at page 314.
And my new-IBTEACH scale get only 97.71% correlation with the PISA IBTEACH scale.

I thought I have followed the steps described in PISA 2015 technical report to create WLE scale, but I still failed to get the same scales.
If you want to try PISA 2015 data, it can be download from https://www.oecd.org/pisa/data/2015database/.
The dataset is big so please download from the URL above.

Thanks for you all.
Comment

lin wei min

Join Date: Nov 2019
Posts: 3

05 Jan 2020, 07:45

hi i think this is much better to others understand what plan to do as the following,but you still need to copy from your stata
一、

Code:

. gen q9=0

. replace q9=4 if st098q09ta==1
(102957 real changes made)

. replace q9=3 if st098q09ta==2
(130499 real changes made)

. replace q9=2 if st098q09ta==3
(142211 real changes made)

. replace q9=1 if st098q09ta==4
(58492 real changes made)

. replace q9=.n if st098q09ta==.n
(2107 real changes made, 2107 to missing)

. replace q9=.m if st098q09ta==.m
(19636 real changes made, 19636 to missing)

二、

Code:

. tab nmis

nmis | Freq. Percent Cum.
------------+-----------------------------------
0 | 411,541 79.24 79.24
1 | 22,818 4.39 83.64
2 | 2,932 0.56 84.20
3 | 1,190 0.23 84.43
4 | 1,158 0.22 84.65
5 | 1,017 0.20 84.85
6 | 1,040 0.20 85.05
7 | 1,744 0.34 85.39
8 | 75,894 14.61 100.00
------------+-----------------------------------
Total | 519,334 100.00

三、

Code:

. irt gpcm q1-q9 [iweight= w_fstuwt] if nmis<=5

Fitting fixed-effects model:

Iteration 0: log likelihood = -2.289e+08
Iteration 1: log likelihood = -2.289e+08

Fitting full model:

Iteration 0: log likelihood = -2.107e+08
Iteration 1: log likelihood = -2.036e+08
Iteration 2: log likelihood = -2.032e+08
Iteration 3: log likelihood = -2.032e+08
Iteration 4: log likelihood = -2.032e+08
Iteration 5: log likelihood = -2.032e+08

Generalized partial credit model Number of obs = 440,656
Log likelihood = -2.032e+08
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
q1 |
Discrim | .8882858 .0004929 1802.33 0.000 .8873198 .8892518
Diff |
2 vs 1 | -2.264344 .0012302 -1840.57 0.000 -2.266755 -2.261933
3 vs 2 | -.3883437 .0006823 -569.13 0.000 -.3896811 -.3870063
4 vs 3 | .2314781 .0006642 348.49 0.000 .2301763 .23278
-------------+----------------------------------------------------------------
q2 |
Discrim | 1.004779 .0005474 1835.65 0.000 1.003707 1.005852
Diff |
2 vs 1 | -.6638453 .0005817 -1141.21 0.000 -.6649854 -.6627052
3 vs 2 | 1.557733 .0009308 1673.55 0.000 1.555909 1.559557
4 vs 3 | 1.626717 .0010788 1507.92 0.000 1.624603 1.628832
-------------+----------------------------------------------------------------
q3 |
Discrim | 1.428847 .0007516 1900.97 0.000 1.427374 1.43032
Diff |
2 vs 1 | -.790332 .0004943 -1598.98 0.000 -.7913008 -.7893633
3 vs 2 | .6300767 .0005276 1194.28 0.000 .6290427 .6311107
4 vs 3 | 1.137401 .0006364 1787.14 0.000 1.136154 1.138648
-------------+----------------------------------------------------------------
q5 |
Discrim | 1.448833 .0007651 1893.64 0.000 1.447334 1.450333
Diff |
2 vs 1 | -1.210142 .0005826 -2077.02 0.000 -1.211284 -1.209
3 vs 2 | .3131231 .0004716 663.97 0.000 .3121988 .3140473
4 vs 3 | 1.036768 .0005772 1796.12 0.000 1.035636 1.037899
-------------+----------------------------------------------------------------
q6 |
Discrim | 1.17227 .0006268 1870.35 0.000 1.171041 1.173498
Diff |
2 vs 1 | -1.911364 .000904 -2114.41 0.000 -1.913136 -1.909592
3 vs 2 | -.2076377 .000526 -394.77 0.000 -.2086686 -.2066068
4 vs 3 | .6297037 .0005603 1123.84 0.000 .6286055 .6308019
-------------+----------------------------------------------------------------
q7 |
Discrim | 1.219602 .0006705 1818.90 0.000 1.218288 1.220916
Diff |
2 vs 1 | .0369019 .0005004 73.75 0.000 .0359211 .0378826
3 vs 2 | 1.20278 .0007448 1614.81 0.000 1.20132 1.20424
4 vs 3 | 1.346852 .0008739 1541.16 0.000 1.345139 1.348565
-------------+----------------------------------------------------------------
q8 |
Discrim | 1.332424 .000721 1847.96 0.000 1.331011 1.333837
Diff |
2 vs 1 | -.2280479 .0004643 -491.16 0.000 -.2289579 -.2271379
3 vs 2 | .9579274 .0006303 1519.70 0.000 .9566919 .9591628
4 vs 3 | 1.335969 .0007685 1738.35 0.000 1.334462 1.337475
-------------+----------------------------------------------------------------
q9 |
Discrim | 1.035103 .0005591 1851.30 0.000 1.034007 1.036199
Diff |
2 vs 1 | -1.577677 .0008405 -1876.99 0.000 -1.579324 -1.576029
3 vs 2 | -.0015175 .0005948 -2.55 0.011 -.0026832 -.0003518
4 vs 3 | .5782089 .0006332 913.10 0.000 .5769678 .57945
------------------------------------------------------------------------------

四、

Code:

. pwcorr nibteach ibteach,star(.01)

| nibteach ibteach
-------------+------------------
nibteach | 1.0000
ibteach | 0.9771* 1.0000

Last edited by lin wei min; 05 Jan 2020, 07:48.

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#6

13 Jan 2020, 13:08

Originally posted by Doris Lin View Post

...
Predict the new scale.

. predict nibteach, latent
(option ebmeans assumed)
(using 7 quadrature points)

Run the correlation table and find the new-IBTEACH I created has 0.9771 correlation coefficient with the original PISA 2015 IBTEACH.

. pwcorr nibteach ibteach,star(.01)

| nibteach ibteach
-------------+------------------
nibteach | 1.0000
ibteach | 0.9771* 1.0000

As the above steps and results, I couldn't get exactly the same parameters values with PISA technical report showed at page 314.
And my new-IBTEACH scale get only 97.71% correlation with the PISA IBTEACH scale.

I thought I have followed the steps described in PISA 2015 technical report to create WLE scale, but I still failed to get the same scales.
...

Doris,

Page 295-296 of the technical report says this:

Invariance of item parameters. PISA 2015 implemented an innovative approach to test whether equal (invariant) item parameters can be assumed across groups of participating countries and language groups therein. In a first step, groups were defined whereas every country or multiple, sufficiently large samples of examinees taking the same questionnaire language version within the country formed one group each. For regular scales, groups are based on country-by-language combinations; for trend scales, groups are based on cycle-by-country-by-language combinations. A senate-weighted sample size of at least 300 cases was considered sufficiently large to form one group. In a second step, international item and person parameters were estimated based on all examinees across all groups (see section “Scaling procedures”).

Based on this estimation, the root mean square deviance (RMSD) item-fit statistic was calculated for each group and item by:
16.9 RMSD = ∫ (Po(θ ) − Pe(θ )) 2 f (θ )dθ
quantifying the difference between the observed item characteristic curve3 (ICC, Po(q )) with the model-based ICC (Pe(q )). The RMSD statistic is sensitive to the group-specific deviations of both the item difficulty parameters and item slope parameters from the international parameters. Values close to zero indicate good item fit, meaning that the model with international item parameters describes the responses in this group very well. A value of 0.3 was set as a cut-off criterion, with larger values indicating that the international item parameters are not appropriate for this group. Instead, a flagged group was allowed to receive group-specific (unique) item parameters and steps 2 and 3 were repeated until all items exhibited RMSD values smaller than 0.3.4 The final distribution of RMSD values across groups will be reported for each item along with each of the scales. (For an explanation of the graphical representation, see section “Evaluating crosscountry comparability” below.)

In other words, this confirms what I said in post #3. PISA conducted differential item function analysis in countries where more than one language was used (e.g. in the US, we might have administered the items in English and Spanish; in Singapore, it might be English, Mandarin, Malay, and Tamil; substitute other countries and languages as needed). I'm not familiar with this technique and there may not be an obvious way to do it in Stata.

I see that the scale you referred to uses ordinal items. You can test for DIF using ordered logistic regression, as I outlined in this post. If you have Stata 16, you could use the likelihood-based method (i.e. the , group() option in the irt commands).

In many cases, DIF in some items may not make a huge difference in overall trait estimates (i.e. it might not cause significant differential test function). Given the high correlation you found between your estimate of this trait and the one provided by PISA, you may be able to safely ignore DIF.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Can anyone help me for the scaling procedure of PISA WLE scores?

Comment

Comment

Comment

Comment

Comment