Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do you create two variables as one in a single dataset?

    Dear all,

    I would like to create a single variable from two already existing ones in my dataset, taking into account (of course) the missing variables in the two existing ones.

    How could I do this in the cleanest and most programming-correct way in Stata?

    The two variables I am interested in are coded from 1 to 5 and from 1 to 5, respectively. However, these variables evaluate different parameters. Here is the code if it helps:

    Code:
    gen ba_hes_heu =  cond(inlist(1, bachelor_hes),1,        ///
                      cond(inlist(1, bachelor_heu),2,        ///
                      cond(inlist(2, bachelor_hes),3,        ///
                      cond(inlist(2, bachelor_heu),4,        ///
                      cond(inlist(3, bachelor_hes),5,.)))))
    *
    *
    * ba_hes_heu == 1 if HES-SO; == 2 if HEU without GE; ==3 if HESGE;
    * == 4 if UNIGE; == 5 if Others BA HES; "." otherwise
    *
    label define bache_hes_heu 1 "BA HES-SO hors GE" 2 "BA Autres HEU" ///
                               3 "BA HES-GE" 4 "BA HEU-GE" 5 "BA Autres HES"
    label value                ba_hes_heu bache_hes_heu 
    label var                    ba_hes_heu "Bachelor HES/HEU Confondus"
    And also:
    Code:
    gen ma_hes_heu =  cond(inlist(1, master_hes),1,            ///
                      cond(inlist(1, master_heu),2,            ///
                      cond(inlist(2, master_hes),3,            ///
                      cond(inlist(2, master_heu),4,            ///
                      cond(inlist(3, master_hes),5,.)))))
    *
    *
    label define master_hes_heu 1 "MA HES-SO hors GE" 2 "MA Autres HEU" ///
                                3 "MA HES-GE" 4 "MA HEU-GE" 5 "MA Autres HES"
    label value                    ma_hes_heu master_hes_heu 
    label var                        ma_hes_heu "Master HES/HEU Confondus"
    * ma_hes_heu == 1 if HES-SO; == 2 if HEU without GE; ==3 if HESGE;
    * == 4 if UNIGE; == 5 if Others BA HES; "." otherwise
    Thanks in advance for your help.

    --
    Michael

  • #2
    The idea is to obtain a categorical variable, with the same numbering, regardless of the degree obtained (either Bachelor or Master).

    For example, the new variable new_var would be equal to 1 if the diploma was obtained at the HES-SO, regardless of whether it is a bachelor or master. Equal to 2 if the degree was obtained at the HEU without GE, regardless of whether it is a bachelor or master, and so on.

    I tried the following code, which is interesting, but gives me only values of 1 or 0:

    Code:
    egen new_var=  anycount(ba_hes_heu ma_hes_heu ), values(1 2 3 4 5)
    Thanks for your help.

    Michael
    Last edited by Michael Duarte Goncalves; 16 Nov 2022, 03:56.

    Comment


    • #3
      You are not doing well on providing data examples. See FAQ Advice #12. https://www.statalist.org/forums/help#stata

      Also the code seems to presuppose that no-one changes between Bachelor's and Master's as someone who did gets assigned different values according to your different rules.

      There are 10 different variables here and I think we need to know more to offer good advice. The question seems to be about combining 10 variables, not 2!

      Comment


      • #4
        Nick Cox:
        I apologise for not putting in an example of data and for the misinterpretation.
        Thanks for the feedback and patience.

        Here is an example from my data:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(ba_hes_heu ma_hes_heu diploma_all)
        1 . 1
        2 . 1
        3 . 1
        4 . 1
        5 . 1
        . 1 1
        . 2 1
        . 3 1
        . 4 1
        . 5 1
        end
        label values ba_hes_heu bache_hes_heu
        label def bache_hes_heu 1 "BA HES-SO hors GE", modify
        label def bache_hes_heu 2 "BA Autres HEU", modify
        label def bache_hes_heu 3 "BA HES-GE", modify
        label def bache_hes_heu 4 "BA HEU-GE", modify
        label def bache_hes_heu 5 "BA Autres HES", modify
        label values ma_hes_heu master_hes_heu
        label def master_hes_heu 1 "MA HES-SO hors GE", modify
        label def master_hes_heu 2 "MA Autres HEU", modify
        label def master_hes_heu 3 "MA HES-GE", modify
        label def master_hes_heu 4 "MA HEU-GE", modify
        label def master_hes_heu 5 "MA Autres HES", modify
        Last edited by Michael Duarte Goncalves; 16 Nov 2022, 06:09.

        Comment


        • #5
          I think this would suffice:
          Code:
          egen byte new_var = rowmax(ba_hes_heu ma_hes_heu)
          which assumes that only one of the ba_hes_heu and ma_hes_heu variables is non-missing for any observation.

          Comment


          • #6
            Hemanshu Kumar : You're absolutely right!

            It's exactly what I was looking for.

            I think it's a silly question, so please forgive me:
            • Why do you often use byte when generating a variable? Is it to decrease the volume and size of the variable, compared to long for example?
            • As a result, would you advise using byte as much as possible for the generation of variables and other underlying activities?
            Thanks a lot.
            --
            Michael

            P.S.: Nick Cox, I apologise again for not using dataex in the first place.
            Last edited by Michael Duarte Goncalves; 16 Nov 2022, 06:43.

            Comment


            • #7
              Yes, since the default numeric data type is float, if byte suffices for your needs, specifying it would save some space. That is probably not important for small datasets, but can make a difference when you have very large data. More often (in my usage), specifying a high enough data type (like double) can be important to ensure the variable has the requisite precision (say if it is an ID variable with large numbers of digits, or if it stores times). I have started specifying the type almost every time I generate a numeric variable as a matter of habit, so that I don't forget to do it when it is important

              Comment


              • #8
                Okay, great!

                Thanks for the comprehensive explanation!

                Michael

                Comment


                • #9
                  (I thought I posted this earlier, but it remained in draft while I rushed off to a series of meetings. Now it is just a footnote to Hemanshu Kumar 's helpful answer.)

                  Thanks for the detail. What happens if someone does BA of one kind and MA of another kind? Does this never happen? If it never happens

                  Code:
                  gen wanted = max(ba_hes_heu, ma_hes_heu)
                  So max() is a function to use. Or min() for that matter.

                  (Before there was egen, rowmax() there was max().)

                  Comment


                  • #10
                    Thanks Nick Cox for the further details!

                    Michael

                    Comment

                    Working...
                    X