No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • advise on most efficient way to collapse likert scale?

    My data is from questionnaire and I need to do a lo of -label define- & -label values-
    I would like to ask for advise on how to do this more efficiently.

    For example, I have made the following
    label define likert_agree ///
        1 "Strongly disagree"    2 "Somewhat disagree" ///
        3 "Neutral"             4 "Somewhat agree" ///    
        5 "Strongly agree"
    Now I want to collapse 1, 2 into one, and 4, 5 into another one, to make a total of 3 categories for variables using this scale.
    I have been using -gen- and -replace- to combine, but found it very inefficient.

    Please advise. Thanks!

  • #2
    How about something like

    webuse nhanes2f,clear
    recode health (1/2 = 1 "below average") (3 = 2 "average") (4/5 = 3 "above average"), gen(xhealth)
    tab1 xhealth
    See the help for recode.
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]


    • #3
      that's great! Thank you very much.


      • #4
        a relevant /similar question:

        I would like to combine the results of this into one for the code to be ovob=1, ob=2
        Essentially, I will be doing analysis only on these two groups of population (ovob= overweight and obese population; ob = obese population).

        The best way I come up is the following:
        tab bmiiotfz, gen(bmidum)
        gen ovob= (bmidum6+bmidum5)
        gen ob = bmidum6
        Please advise if there's a better way to help improve management/analysis.

        Thank you!


        • #5
          We can not possibly tell you what will be better for a project and dataset that we cannot know or see. The advice can at best be Delphic: that depends.

          But you are throwing away information in your data step by step. Why do that? Keep as much as detail as you are given. A 5-category scale is already a coarse and crude scale.


          • #6
            Thank you, Nick, for your comments. I'm sorry that I didn't explain about my data analysis clearly. I am doing a cross-sectional analysis using the China Health and Nutrition Survey data to examine individual factors that may have contributed to the sex difference in overweight and obese outcomes. To correspond to some of my hypotheses, I am showing only overweight/obese children for some analyses, thus thought about creating variables with only overweight/obese population.

            I'm elementary to stats and learned STATA program on my own and not coding savvy. I truly appreciate all your time to help. Thanks.


            • #7
              Thanks for the extra detail. I am in no sense a biostatistician or medical statistician but my very broad advice remains as above.


              • #8
                I was very much intrigued by the "most efficient" part of the question.

                For all real-life situations Richard Williams 's approach with recode is most suitable: It is brief, readable, and understandable. Coming back to the program after a few years will not be a problem.
                However it is hardly the most efficient approach in terms of performance.

                I have looked at a few alternatives (recode is #1, Stata 13.0 on Windows):

                   1:     49.27 /       10 =       4.9273
                   2:      9.53 /       10 =       0.9531
                   3:      7.23 /       10 =       0.7228
                   4:      7.09 /       10 =       0.7092
                   5:      6.60 /       10 =       0.6600
                   6:      6.34 /       10 =       0.6345
                   7:      4.91 /       10 =       0.4912
                   8:      5.04 /       10 =       0.5038
                   9:      5.11 /       10 =       0.5111
                  10:      4.90 /       10 =       0.4898
                I was surprised to see:
                • inlist function performing faster than the simple or operator;
                • floats faster than both doubles and bytes;
                • creating a new variable works faster than replacing values in existing one (perhaps the biggest shocker).
                Code is here:
                This was really done in a rush, so if there is any problem with the code, let me know.

                That said, the original approach of Vivian H Wang with "gen and replace" might not be as bad after all.

                Now I wouldn't be surprised if Bill Gould (StataCorp) comes out with yet another solution that beats #10 10 times speed-wise, but most interesting would be to hear commentaries to the above paradoxes, since they all depend on the internal implementation of Stata that is known only to the developers.

                Best, Sergiy Radyakin


                • #9
                  I know that many people look down on inefficient code such as recode. However, unless you've got a monstrous data set or will do this over and over it often isn't worth the time to come up with something better. Mostly you want something that works. (Nonetheless I often do waiste my time on efficient code just because I get obsessed with taking on some challenge!)
                  Richard Williams, Notre Dame Dept of Sociology
                  Stata Version: 17.0 MP (2 processor)

                  EMAIL: [email protected]


                  • #10
                    Dear all, thanks a lot for your great inputs!
                    Sergiy: Thank you for your help! However, while I'm very close to my deadline, I would like to try this when I have time to explore further. My very basic understanding of Stata will require me to do more research in order to fully comprehend your codes : )

                    For this thesis i'm working on, I used several different ways to manage my variables, as the followings for your interest:
                    recode bmiiotfz ///
                        (-3/-1=1 "Underweight")(0=2 "Normal weight")(1=3 "Overweight") ///
                        (2=4 "Obese"), gen(wt_sta)
                    tab bmiiotfz, gen(bmidum)
                    gen ovob= (bmidum6+bmidum5)
                    gen ob = bmidum6
                    *central obesity*
                        gen whtr= wc/ht
                        egen centrob= cut(whtr), at(0,0.5,2) icodes
                    *b. SES 
                        *parental education 
                        label define ed_level ///
                            0 "Up to high school degree"     1 "college or higher" 
                        label values (medu fedu) ed_level
                    Any comments/advice are welcome.

                    Thank you!