Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • create index from likert scaled variables

    Hello everyone,

    I barely have any experience with stata, I've only made myself familiar with the surface so far and know some basic commands. Maybe someone can help me with this:
    For my paper, I want to analyze how populistic opinions influence one's voting behavior. I'm recreating a study that was made of the German election in 2017 with new data for the election in 2021.
    In the data set I'm using, there are 15 likert scaled variables measuring populistic opinions, answers going from 1 (fully agree) to 5 (fully disagree).
    First, I'd like to create an index that gives me one average value for each person. If I got it correctly, I have to take out the missings first via mvdecode q51c, mv (-99 = .) and then generate an index by adding all variables and dividing them by 15.
    I've tried doing this with three variables first, to check if I'm doing it correctly or not. In the screenshot below, you see the results I got, but I'm not exactly sure how to interpret them and know if I'm on the right path or not.

    Click image for larger version

Name:	Bildschirmfoto 2022-08-15 um 21.02.43.png
Views:	1
Size:	78.9 KB
ID:	1677828


    Thanks.

  • #2
    generate an index by adding all variables
    In Stata, the asterisk * denotes multiplication, the plus sign + denotes addition. So that needs to be corrected in your code.

    The average of a collection of 1-5 variables should be between 1 and 5.

    Comment


    • #3


      You need to think a bit about dealing with the missing values beyond just commendably changing -99 to system missing value (.). If somebody has responded, say, to only 4 of the 15 items, do you really want to use the average of just those four as an index value for that person? On the other hand, calculating that mean by adding up the items and dividing by 3 as you have done takes the other extreme position: if a person has failed to respond to even one item, you will get missing value for your index variable and lose all the information in the 14 items that were answered. It might be more reasonable to set some cutoff, such as perhaps 12 items, and say that you will calculate the index if the person has answered 12 or more items, but not otherwise. That can be done as follows:
      Code:
      egen nmcount = rownonmiss(q51*)
      egen index = rowmean(q51*) if nmcount >= 12
      (12 is not a magic number here. I chose it because it is 80% of 15, and an 80% response rule of thumb is commonly used in some fields. You may prefer a different cutoff.)

      Combining items into an index like this is usually reserved for situations where the items demonstrably measure indicators or aspects of the same construct, i.e. high internal coherence. Assuming this is your situation, then a statistically better approach to dealing with this might be the use of multiple imputation to deal with non-response. But as you are new to Stata and multiple imputation is pretty complicated, probably just put that idea on your to-do list for when you are more experienced.

      Comment


      • #4
        You may find it easier to use Stata's -egen- command, which accepts a varlist:

        Code:
        egen index3 = rowmean(q51a - q51c)
        This will be more useful with larger numbers of variables.

        Comment


        • #5
          if you really want the rowmean of 15 variables, you are probably best using the "rowmean" function of the -egen- command; however, read the help file carefully to see whether what it does with missing values is what you want done; and yes, you need to deal with missing value codes first (as you said); see
          Code:
          h egen

          Comment


          • #6
            Originally posted by Clyde Schechter View Post

            You need to think a bit about dealing with the missing values beyond just commendably changing -99 to system missing value (.). If somebody has responded, say, to only 4 of the 15 items, do you really want to use the average of just those four as an index value for that person? On the other hand, calculating that mean by adding up the items and dividing by 3 as you have done takes the other extreme position: if a person has failed to respond to even one item, you will get missing value for your index variable and lose all the information in the 14 items that were answered. It might be more reasonable to set some cutoff, such as perhaps 12 items, and say that you will calculate the index if the person has answered 12 or more items, but not otherwise. That can be done as follows:
            Code:
            egen nmcount = rownonmiss(q51*)
            egen index = rowmean(q51*) if nmcount >= 12
            (12 is not a magic number here. I chose it because it is 80% of 15, and an 80% response rule of thumb is commonly used in some fields. You may prefer a different cutoff.)

            Combining items into an index like this is usually reserved for situations where the items demonstrably measure indicators or aspects of the same construct, i.e. high internal coherence. Assuming this is your situation, then a statistically better approach to dealing with this might be the use of multiple imputation to deal with non-response. But as you are new to Stata and multiple imputation is pretty complicated, probably just put that idea on your to-do list for when you are more experienced.
            Thank you for the precise answer, Clyde. I'll look into it!

            Comment


            • #7
              Originally posted by Hemanshu Kumar View Post
              You may find it easier to use Stata's -egen- command, which accepts a varlist:

              Code:
              egen index3 = rowmean(q51a - q51c)
              This will be more useful with larger numbers of variables.
              Thank you, I'll try using it!

              Comment


              • #8
                Originally posted by William Lisowski View Post

                In Stata, the asterisk * denotes multiplication, the plus sign + denotes addition. So that needs to be corrected in your code.

                The average of a collection of 1-5 variables should be between 1 and 5.
                Oh you're right, I somehow mixed that up. Thanks!

                Comment


                • #9
                  Originally posted by Rich Goldstein View Post
                  if you really want the rowmean of 15 variables, you are probably best using the "rowmean" function of the -egen- command; however, read the help file carefully to see whether what it does with missing values is what you want done; and yes, you need to deal with missing value codes first (as you said); see
                  Code:
                  h egen
                  Thank you, I'll look into it!

                  Comment

                  Working...
                  X