Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a categorical variable from multiple continuous variables

    Hi everyone,

    New Stata user and first-time poster here. I'm trying to create a new categorical variable, catvar, based on 4 continuous variables: avar, bvar, cvar, and dvar.

    Basically, catvar = 0 if avar, bvar, cvar, and dvar are all < 70;

    catvar = 1 if exactly one of the four continuous variables > 70;

    catvar = 2 if exactly two of the four are > 70 and within 10 units of one another;

    catvar = 3 if exactly three of the four are > 70 and within 10 units of one another; and

    catvar = 4 if all four variables are > 70 and within 10 units of one another.

    I'm having some trouble getting the logic/operators to work, and would appreciate any/all tips, readings, and/or suggestions -- thanks so much!

  • #2
    What do you want to do if more than one of avar, bvar, cvar, and dvar are > 70 but they are not within 10 units of each other? And what do you do if all four are = 70.

    Comment


    • #3
      Hi Clyde!

      If all four = 70, then catvar should = 4.

      If avar, bvar, cvar, and dvar are all > 70 but *not* within 10 units of one another, then it depends on how many are within the appropriate range. For example:

      avar = 89, bvar = 83, cvar = 90, dvar = 79

      There is an 11-unit difference between the lowest value (dvar = 79 and cvar = 90), so I wouldn't assign this entry a four. However, avar, bvar, and cvar would be assigned a 3 because they are all above 70 and within 10 units of each other.

      Thanks so much!

      Comment


      • #4
        Sanjana, As Clyde implied, your definition of the categorical variable may be incomplete. Based on your further description in #3, please double-check if the following is what you want.

        1. The four variables are all < 70 --> catvar = 0

        2. Exactly one variable >= 70 --> catvar = 1

        3. Exactly two variables >= 70
        3.1 Distance of the two <= 10 --> catvar = 2
        3.1 Distance of the two > 10 --> catvar = 1?

        4. Exactly three variables >= 70
        4.1 Distance of all pairs (three pairs in total) from the three <= 10 --> catvar = 3
        4.2 If 4.1 doesn't hold, do I need to count the maximal number of variables from the three that the distance of one another <= 10? For example, for 71, 76, 83, 50, catvar = 2? for 71, 76, 90, 50, catvar = 2 again? for 71, 82, 93, 50, catvar = 1?

        5. The four variables are all >= 70
        5.1 Distance of all pairs (six pairs in total) from the four <= 10 --> catvar = 4
        5.2 if 5.1 doesn't hold, do I need to count the maximal number of the variables from the four that the distance of one another <= 10, like in 4? In your case of #3, not only abc, but also abd satisfies the condition. What if avar = 89, bvar = 83, cvar = 100 and dvar = 71, catvar = 2?
        Last edited by Fei Wang; 05 Nov 2021, 20:13.

        Comment


        • #5
          Hi, Fei - yes to everything you wrote in #4. Thank you for presenting it so clearly! Any suggestions for what commands I should try?

          Thanks again!

          Comment


          • #6
            With the clarification from Fei Wang I propose
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear*
            input float(avar bvar cvar dvar)
            53 53 65 98
            61 75 64 94
            55 67 85 58
            98 62 85 80
            57 69 69 77
            80 70 66 56
            58 50 94 79
            98 90 56 62
            52 50 74 84
            54 75 70 52
            63 85 59 71
            92 81 79 78
            62 96 81 95
            69 93 52 54
            55 71 72 57
            92 90 98 63
            90 75 99 82
            51 57 76 70
            63 78 57 87
            99 79 75 73
            50 75 95 96
            86 56 86 97
            91 98 56 98
            84 61 73 66
            75 59 89 54
            72 93 85 99
            96 70 80 56
            87 88 51 90
            64 73 77 73
            98 69 92 61
            70 60 80 75
            72 70 66 96
            70 92 78 78
            53 67 70 77
            94 77 55 72
            94 59 51 90
            51 98 56 59
            96 84 57 54
            60 62 100 81
            65 77 54 83
            69 51 82 84
            56 97 75 76
            63 55 73 51
            87 98 88 68
            87 90 50 85
            91 92 74 75
            77 61 58 84
            95 68 93 85
            54 85 50 85
            80 87 85 56
            end
            
            gen long obs_no = _n
            frame put _all, into(working)
            frame change working
            rename *var var#, addnumber
            reshape long var, i(obs_no)
            keep if var >= 70 & !missing(var)
            sort obs_no var
            gen catvar = .
            by obs_no: replace catvar = 4 if var[4] - var[1] <= 10
            by obs_no: replace catvar = 3 if missing(catvar) ///
            & min(var[3]-var[1], var[4]-var[2]) <= 10
            by obs_no: replace catvar = 2 if missing(catvar) ///
            & min(var[2]-var[1], var[3]-var[2], var[4]-var[3]) <= 10
            by obs_no: replace catvar = 1 if missing(catvar)
            keep obs_no catvar
            by obs_no: keep if _n == 1
            frame change default
            frlink 1:1 obs_no, frame(working)
            frget catvar, from(working)
            replace catvar = 0 if missing(working)
            The data were made up with random integers between 50 and 100. As is so common, this task, difficult with the data in wide layout, is not hard in long layout. Note also the usefulness of frames here: by putting the data in a new frame, we can remove the numbers < 70, rather than having to complicate the code to exclude them from the counts, and yet, in the original frame, all of the original information is preserved.

            In the future, when asking for code, please provide example data, using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

            When asking for help with code, always provide sample data. When providing sample data, always use -dataex-.

            Added: Modify code to properly handle missing values.
            Last edited by Clyde Schechter; 05 Nov 2021, 22:44.

            Comment


            • #7
              EDIT: Crossed with Clyde's response in #6

              I don't know if this helps, but here are some ideas that might help you get started. I created some toy data.
              You will need to install egenmore (SSC install egenmore) if you don't already have it. It adds additional functionality to the egen command.



              Code:
              * Example generated by -dataex-. For more info, type help dataex
              dataex id avar bvar cvar dvar
              clear
              input byte id int avar byte(bvar cvar dvar)
               1  89 83 90 79
               2  61 89 92 60
               3 101 86 68 77
               4  74 79 60 65
               5  64 95 69 94
               6  60 69 62 65
               7  84 69 83 76
               8  75 90 70 69
               9  66 67 86 91
              10  88 64 91 83
              end
              
              
              . list, noobs
              
                +--------------------------------+
                | id   avar   bvar   cvar   dvar |
                |--------------------------------|
                |  1     89     83     90     79 |
                |  2     61     89     92     60 |
                |  3    101     86     68     77 |
                |  4     74     79     60     65 |
                |  5     64     95     69     94 |
                |--------------------------------|
                |  6     60     69     62     65 |
                |  7     84     69     83     76 |
                |  8     75     90     70     69 |
                |  9     66     67     86     91 |
                | 10     88     64     91     83 |
                +--------------------------------+
              
              ssc install egenmore  // if not already installed
              egen below70 = rcount( avar bvar cvar dvar), cond(@ < 70)    //  counts how many in the varlist are less than 70
              //  @ is what it uses as the abbrev for the variables.  As it loops over the variables, @ gets replaced with avar, bvar, cvar, and dvar
              egen rmin = rowmin(avar bvar cvar dvar)
              egen rmax = rowmax(avar bvar cvar dvar)
              gen range = rmax - rmin
              
              egen rstd_dev = rowsd(avar bvar cvar dvar)
              format rstd_dev %8.1fc
              
              egen within10 = rcount(bvar cvar dvar), cond(abs(avar - @) <= 10)   // counts how many are within 10 units of avar  (abs is absolute value)
              order within10, after(below70)  // just places the within10 variable after below70
              
              list, noobs
              
                +-------------------------------------------------------------------------------------+
                | id   avar   bvar   cvar   dvar    below70   within10   rmin   rmax   range   rstd_dev |
                |-------------------------------------------------------------------------------------|
                |  1     89     83     90     79        0          3     79     90      11        5.2 |
                |  2     61     89     92     60        2          1     60     92      32       17.4 |
                |  3    101     86     68     77        1          0     68    101      33       14.1 |
                |  4     74     79     60     65        2          2     60     79      19        8.6 |
                |  5     64     95     69     94        2          1     64     95      31       16.3 |
                |-------------------------------------------------------------------------------------|
                |  6     60     69     62     65        4          3     60     69       9        3.9 |
                |  7     84     69     83     76        1          2     69     84      15        7.0 |
                |  8     75     90     70     69        1          2     69     90      21        9.7 |
                |  9     66     67     86     91        2          1     66     91      25       12.9 |
                | 10     88     64     91     83        1          2     64     91      27       12.1 |
                +-------------------------------------------------------------------------------------+
              So this will get you partially going. You will now have the number of variables that are < 70. Note that the within10 variable counts how many are within 10 units of avar. Because avar doesn't get compared to itself, the max number here is 3.
              Obviously you could count how many are within 10 units of bvar or rmin or rmax.
              Last edited by David Benson; 05 Nov 2021, 23:13.

              Comment


              • #8
                This is incredibly helpful -- thank you all for your help!

                Comment

                Working...
                X