Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating unique ID variable from numeric and categorical variables

    Hello all, I am new to the forum and I have a question about generating new IDs. Thanks for your patience as I get started.

    In my dataset, observations are for infrastructure projects of "infra" type (A = building, B = bridge, etc.) in "vID" villages. Each village has its own unique ID. My goal is to create a new variable with unique infrastructure IDs, based on "vID" and "infra". As shown below, a single village can have multiple infrastructure projects of the same type.

    Code:
    ----------------------------------------
    vID          duple  infra    desired
    ----------------------------------------
    1118010006     0      A     1118010006A1
    3203150004     0      A     3203150004A1
    6110020012     2      A     6110020012A1
    6110020012     2      A     6110020012A2
    1118010002     3      A     1118010002A1
    1118010002     3      A     1118010002A2
    1118010002     3      A     1118010002A3

    I believe one step in this process may involve concatenate, but I would first need to generate unique "counts" (?) of each project of the same type in the same village. This is the tricky part for me. I have tried strategies such as that in the code box below. However, this only gives me two "2" values for two projects in the same village. Rather, I want one "1" value for the first project and one "2" value for the second project in the same village.

    Code:
    egen infraID = count(vID), by (vID)

    If concatenate is a good way forward, I have trouble formatting the new string variable to display the full numeric segment. I would prefer "1118010006A1" displayed. Instead, Stata displays "1.23e+09A1" as an example. I have tried and had no luck digging through format guides for categorical variables. I am using Stata 15.1 on a mac, for the record.

    Many thanks for your attention!
    Last edited by Matthew Borden; 30 May 2019, 01:05.

  • #2
    This sounds like

    Code:
    bysort vID (infra) : gen digit = _n 
    egen wanted = concat(vID infra digit)

    Comment


    • #3
      Nick, that does the trick nicely. Thank you. And do you happen to have any input on displaying the new string variable IDs as "1118010006A1" instead of "1.23e+09A1" ?

      Comment


      • #4
        That shouldn't happen. It's as said a string variable and numeric display formats don't apply.

        Code:
        . clear
        
        . set obs 1
        number of observations (_N) was 0, now 1
        
        . gen whatever = "1118010002A3"
        
        . list
        
             +--------------+
             |     whatever |
             |--------------|
          1. | 1118010002A3 |
             +--------------+

        Comment


        • #5
          A minute reviewing the output of help egen for the concat() function shows us a format() option for that function.
          Code:
          . clear
          
          . set obs 1
          number of observations (_N) was 0, now 1
          
          . generate long a = 1118010006
          
          . format a %10.0f
          
          . generate str1 A = "A"
          
          . generate int  t = 3
          
          . egen notwanted = concat(a A t)
          
          . list, clean noobs
          
                       a   A   t    notwanted  
              1118010006   A   3   1.12e+09A3  
          
          . egen wanted = concat(a A t), format(%10.0f)
          
          . list, clean noobs
          
                       a   A   t    notwanted         wanted  
              1118010006   A   3   1.12e+09A3   1118010006A3

          Comment


          • #6
            I see what's happened from the post of William Lisowski You didn't specify a format() when calling up egen, concat(). Sorry, I was forgetting that vID is a numeric variable, although it's implied by #1. I was unduly influenced by my own habit of holding such identifiers as string.

            My fault, but also true that giving a data example in #1 using dataex would have allowed easy experiment and shown up this issue.

            Comment


            • #7
              Thank you, William and Nick.

              I realized that my new ID values were displaying in the form of "1.23e+09A1" because I had generated [unique] vID values from component province, district, and sub-district variables, in numeric form. I came back to Stata and ran the code in #2 after converting vID to a string and I now have the ID values in the form of "1118010006A1". The code in #5 is good to keep in mind too.

              Comment

              Working...
              X