Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen numeric variable

    Dear Stata users,

    Please take a look at this problem: I wish to generate a unique identifier for a panel data set. The id is supposed to be composed of values drawn from three existing variables. Therefore I used this code: egen [float] ID = concat(var1 var2 var3)

    The result is almost correct. Unfortunately, the new variable ID is created as a string variable, despite the float specification in my code. Moreover, the values all look like this: 6.70e+071020 which makes it impossible to use destring as this format contains non-numeric characters. I have tried other type specifications but none of them work.

    Does anyone know how to do in order to get an output that looks like this: 670007181020 ?

    Thank you!

  • #2
    Code:
    egen ID = concat(var1 var2 var3), format("%13.0f")
    See

    Code:
    help format

    EDIT: To convert to numeric, it's better to use the -group()- function of egen. This will be more efficient than directly destringing.

    Code:
    egen NID= group(ID), lab
    Last edited by Andrew Musau; 22 Apr 2020, 03:01.

    Comment


    • #3
      https://www.stata-journal.com/sjpdf....iclenum=dm0034 will be found to support Andrew Musau's advice.

      The help documents that concat() produces a string variable and as the original author (about 1998) I endorse that. In fact looking at the code will show that type declarations are ignored.

      Comment


      • #4
        Thank you both for your advice!

        Although the type is now float, unfortunately the group function does not seem to solve the format issue: variable contents include e+07 rather than the correct numbers as from the original variable var1 of the concat argument. It seems that this part of information is lost when the concat function is applied.

        Perhaps there is another function to compose a numeric variable from multiple other variables that I do not know yet. I will see what I can find and thank you much for your help!


        Comment


        • #5
          There is no data example here to focus minds. Please note https://www.statalist.org/forums/help#stata

          In mathematics abstraction is often the key to making progress but in statistics concreteness is often helpful to make plain what you are trying to do.

          var1 var2 var3 is too abstract to make a problem clear but an example like 670007181020 does help a little.

          Are the original variables numeric or string? Why do you want a numeric result? How many distinct values are there for the combination of three variables?

          Somewhat contrary to advice given here you have produced a float somehow -- with code you do not show us -- but what experiment will make clear is that

          1. There are not enough bits in a float to hold numbers like 670007181020 accurately.

          2. The default display format won't help. You need to specify the display format too.


          Code:
          .  clear
          
          . set obs 3
          number of observations (_N) was 0, now 3
          
          . gen whatever = 670007181020 + _n
          
          . l
          
               +----------+
               | whatever |
               |----------|
            1. | 6.70e+11 |
            2. | 6.70e+11 |
            3. | 6.70e+11 |
               +----------+
          
          . format %12.0f whatever
          
          . l
          
               +--------------+
               |     whatever |
               |--------------|
            1. | 670007164928 |
            2. | 670007164928 |
            3. | 670007164928 |
               +--------------+
          
          . gen somethingelse = 670007181020
          
          . format somethingelse %12.0f
          
          . l
          
               +-----------------------------+
               |     whatever   somethinge~e |
               |-----------------------------|
            1. | 670007164928   670007164928 |
            2. | 670007164928   670007164928 |
            3. | 670007164928   670007164928 |
               +-----------------------------+
          
          . gen double betteridea = 670007181020
          
          . format betteridea %12.0f
          
          . list
          
               +--------------------------------------------+
               |     whatever   somethinge~e     betteridea |
               |--------------------------------------------|
            1. | 670007164928   670007164928   670007181020 |
            2. | 670007164928   670007164928   670007181020 |
            3. | 670007164928   670007164928   670007181020 |
               +--------------------------------------------+

          For numeric identifiers, I still favour group() with labels and there should be automatic promotion to whatever storage type is needed.

          Comment


          • #6
            Very sorry for the lack of explanation, I thought is was just a matter of choosing the right function. I have tried providing more space via format but it did not resolve the e+07 problem. If float is not appropriate I do not mind it being double.

            I need a repeated identifier for my panel that repeats every year, therefore a random id generation does not help. The result needs to be numeric for in order to declare the panel via xtset.

            For now I have var1 over 8 digits, ex. 67000718, var2 over up to 2 digits, ex. 36 and var3 over up to 2 digits, ex. 2. However, var 1 appears several times per year as does var 2 and var 3 and the observation can only be distinguished by the combination of var1, var2 and var3, which is a unique combination per year.

            So ideally I would like to obtain this result:

            var1 var2 var3 ID
            35046150 30 1 350461503001
            35046150 31 1 350461503101
            67000718 1 2 670007180102

            code:

            egen ID= group(var1 var2 var3), lab
            recast double ID
            format %22.0g ID



            Last edited by Anna Holm; 23 Apr 2020, 04:16.

            Comment


            • #7
              Does the value "350461503001" mean anything to you? If you are simply interested in a unique identifier, the replies emphasize that

              Code:
              egen ID= group(var1 var2 var3)
              is sufficient. You do not need any labels. If on the other hand the values mean something, you can get the unique identifier as above with labels

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double(var1 var2 var3)
              35046150 30 1
              35046150 31 1
              67000718  1 2
              end
              
              egen SID= concat(var1 var2 var3), format("%12.0f")
              egen ID= group(SID), lab
              Finally, it is also possible to use egen concat + destring, but you will be holding the ID variable as double when there is no need to do that.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double(var1 var2 var3)
              35046150 30 1
              35046150 31 1
              67000718  1 2
              end
              
              
              egen SID= concat(var1 var2 var3), format("%12.0f")
              gen double ID = real(SID)
              format ID %12.0f

              Comment


              • #8
                Thank you, your second version of the code works perfectly well!

                Comment


                • #9
                  Thanks for the extra detail, which helps a lot, although you please really should read and act on the link in #5. I still have no idea whether these variables are numeric or string.

                  Your example shows up a problem with egen, group() and label that I had not realised. It does mangle the labels it produces if any argument is a large number. That's fixable by StataCorp but there is, or could be, an immediate work-around.

                  Your desire to use xtset which drives the need for a numeric identifier is perfectly consistent with using group() otherwise. Consider this combination.

                  Code:
                  clear
                  input long var1 var2 var3  
                  35046150 30 1
                  35046150 31 1
                  67000718 1 2
                  end
                  
                  egen ID = group(var1 var2 var3)
                  
                  list
                  
                       +-----------------------------+
                       |     var1   var2   var3   ID |
                       |-----------------------------|
                    1. | 35046150     30      1    1 |
                    2. | 35046150     31      1    2 |
                    3. | 67000718      1      2    3 |
                       +-----------------------------+
                  
                  egen label = concat(var1 var2 var3), format(%13.0f) p(" ")
                  
                  labmask ID, values(label)
                  
                  list
                  
                       +--------------------------------------------------------+
                       |     var1   var2   var3              ID           label |
                       |--------------------------------------------------------|
                    1. | 35046150     30      1   35046150 30 1   35046150 30 1 |
                    2. | 35046150     31      1   35046150 31 1   35046150 31 1 |
                    3. | 67000718      1      2    67000718 1 2    67000718 1 2 |
                       +--------------------------------------------------------+

                  In essence,

                  1. group() alone is enough to produce a numeric identifier. xtset will be happy with that.

                  2. If you want that labelled, one approach is to produce string concatenations and then call up labmask to copy its values to value labels. You must install labmask before you can use it after

                  Code:
                  search labmask, sj
                  That won't work (in Stata 16) if you have more than 65,536 distinct values, in which case

                  3. You can always use the label variable produced in the code above, or the original identifiers directly, to see which panel is which.

                  Note that recast double is legal but only useful looking forwards to the future in extending the scope of a variable. recast double is futile retrospectively insofar as digits lost or mangled in previous calculations will never be restored.

                  NOTE: I spent a long time drafting this given other commitments, so #7 and #8 were not visible until I posted. The message is similar either way.
                  Last edited by Nick Cox; 23 Apr 2020, 05:55.

                  Comment


                  • #10
                    Thank you so much, Nick. I'm sorry for the incomplete information, all initial variables are numeric. And yes, the solution works perfectly! Many thanks again for your time.

                    Comment

                    Working...
                    X