create index from likert scaled variables

Lucy Block

Join Date: Aug 2022

Posts: 6
#1

create index from likert scaled variables

15 Aug 2022, 13:19

Hello everyone,

I barely have any experience with stata, I've only made myself familiar with the surface so far and know some basic commands. Maybe someone can help me with this:
For my paper, I want to analyze how populistic opinions influence one's voting behavior. I'm recreating a study that was made of the German election in 2017 with new data for the election in 2021.
In the data set I'm using, there are 15 likert scaled variables measuring populistic opinions, answers going from 1 (fully agree) to 5 (fully disagree).
First, I'd like to create an index that gives me one average value for each person. If I got it correctly, I have to take out the missings first via mvdecode q51c, mv (-99 = .) and then generate an index by adding all variables and dividing them by 15.
I've tried doing this with three variables first, to check if I'm doing it correctly or not. In the screenshot below, you see the results I got, but I'm not exactly sure how to interpret them and know if I'm on the right path or not.

Thanks.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

15 Aug 2022, 13:40

generate an index by adding all variables

In Stata, the asterisk * denotes multiplication, the plus sign + denotes addition. So that needs to be corrected in your code.

The average of a collection of 1-5 variables should be between 1 and 5.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#3

15 Aug 2022, 14:15

You need to think a bit about dealing with the missing values beyond just commendably changing -99 to system missing value (.). If somebody has responded, say, to only 4 of the 15 items, do you really want to use the average of just those four as an index value for that person? On the other hand, calculating that mean by adding up the items and dividing by 3 as you have done takes the other extreme position: if a person has failed to respond to even one item, you will get missing value for your index variable and lose all the information in the 14 items that were answered. It might be more reasonable to set some cutoff, such as perhaps 12 items, and say that you will calculate the index if the person has answered 12 or more items, but not otherwise. That can be done as follows:

Code:

egen nmcount = rownonmiss(q51*) egen index = rowmean(q51*) if nmcount >= 12

(12 is not a magic number here. I chose it because it is 80% of 15, and an 80% response rule of thumb is commonly used in some fields. You may prefer a different cutoff.)

Combining items into an index like this is usually reserved for situations where the items demonstrably measure indicators or aspects of the same construct, i.e. high internal coherence. Assuming this is your situation, then a statistically better approach to dealing with this might be the use of multiple imputation to deal with non-response. But as you are new to Stata and multiple imputation is pretty complicated, probably just put that idea on your to-do list for when you are more experienced.
2 likes
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1396
#4

15 Aug 2022, 14:18

You may find it easier to use Stata's -egen- command, which accepts a varlist:

Code:

egen index3 = rowmean(q51a - q51c)

This will be more useful with larger numbers of variables.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#5

15 Aug 2022, 14:18

if you really want the rowmean of 15 variables, you are probably best using the "rowmean" function of the -egen- command; however, read the help file carefully to see whether what it does with missing values is what you want done; and yes, you need to deal with missing value codes first (as you said); see

Code:

h egen
Comment
Lucy Block

Join Date: Aug 2022

Posts: 6
#6

15 Aug 2022, 22:24

Originally posted by Clyde Schechter View Post

You need to think a bit about dealing with the missing values beyond just commendably changing -99 to system missing value (.). If somebody has responded, say, to only 4 of the 15 items, do you really want to use the average of just those four as an index value for that person? On the other hand, calculating that mean by adding up the items and dividing by 3 as you have done takes the other extreme position: if a person has failed to respond to even one item, you will get missing value for your index variable and lose all the information in the 14 items that were answered. It might be more reasonable to set some cutoff, such as perhaps 12 items, and say that you will calculate the index if the person has answered 12 or more items, but not otherwise. That can be done as follows:

Code:

egen nmcount = rownonmiss(q51*) egen index = rowmean(q51*) if nmcount >= 12

(12 is not a magic number here. I chose it because it is 80% of 15, and an 80% response rule of thumb is commonly used in some fields. You may prefer a different cutoff.)

Combining items into an index like this is usually reserved for situations where the items demonstrably measure indicators or aspects of the same construct, i.e. high internal coherence. Assuming this is your situation, then a statistically better approach to dealing with this might be the use of multiple imputation to deal with non-response. But as you are new to Stata and multiple imputation is pretty complicated, probably just put that idea on your to-do list for when you are more experienced.

Thank you for the precise answer, Clyde. I'll look into it!
Comment
Lucy Block

Join Date: Aug 2022

Posts: 6
#7

15 Aug 2022, 22:25

Originally posted by Hemanshu Kumar View Post

You may find it easier to use Stata's -egen- command, which accepts a varlist:

Code:

egen index3 = rowmean(q51a - q51c)

This will be more useful with larger numbers of variables.

Thank you, I'll try using it!
Comment
Lucy Block

Join Date: Aug 2022

Posts: 6
#8

15 Aug 2022, 22:29

Originally posted by William Lisowski View Post

In Stata, the asterisk * denotes multiplication, the plus sign + denotes addition. So that needs to be corrected in your code.

The average of a collection of 1-5 variables should be between 1 and 5.

Oh you're right, I somehow mixed that up. Thanks!
Comment
Lucy Block

Join Date: Aug 2022

Posts: 6
#9

15 Aug 2022, 22:30

Originally posted by Rich Goldstein View Post

if you really want the rowmean of 15 variables, you are probably best using the "rowmean" function of the -egen- command; however, read the help file carefully to see whether what it does with missing values is what you want done; and yes, you need to deal with missing value codes first (as you said); see

Code:

h egen

Thank you, I'll look into it!
Comment

Announcement

create index from likert scaled variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment