Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Egen concat without scientific notatio

    Dear Stata users,
    I am struggling with the following issue:

    I need to combine two personal ID's in order to create a unique dyadic-ID (an ID that uniquely identifies a relationship between person A and person B).
    I tried the command - egen concat -, and apparently it seems to work. The problem is that, since my ID's are very long, the resultant dyadic-ID is not showed in its whole lenght, but in scientific format with the letter e. So, I end up with several dyadic-ID's with the same value, since their real number of digits is higher than the number of digit showed by Stata (at least, I intepreted in this way what happened).

    Do you have any clue on how to solve?

    Here the data

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(pidp apidp)
     476594335  476594339
    1156943847 1156943851
     748322335  748322339
    1428284927 1428284931
    1089950939 1089950943
    end
    Thanks a lot, G

  • #2
    Perhaps this will point you in a useful direction.
    Code:
    . generate id = trim(strofreal(pidp,"%10.0f")) + ":" + trim(strofreal(apidp,"%10.0f"))
    
    . list, clean
    
                 pidp        apidp                      id  
      1.    476594335    476594339     476594335:476594339  
      2.   1156943847   1156943851   1156943847:1156943851  
      3.    748322335    748322339     748322335:748322339  
      4.   1428284927   1428284931   1428284927:1428284931  
      5.   1089950939   1089950943   1089950939:1089950943
    Or, after reading the documentation for the egen function concat() in the output of help egen
    Code:
    . egen id = concat(pidp apidp), format(%10.0f) punct(:)
    
    . list, clean
    
                 pidp        apidp                      id  
      1.    476594335    476594339     476594335:476594339  
      2.   1156943847   1156943851   1156943847:1156943851  
      3.    748322335    748322339     748322335:748322339  
      4.   1428284927   1428284931   1428284927:1428284931  
      5.   1089950939   1089950943   1089950939:1089950943
    Last edited by William Lisowski; 30 Oct 2018, 08:47.

    Comment


    • #3
      Dear William,
      thanks a lot for your answer.

      With your command I succeed in obtaining a unique dyadic-ID (with all the digits), but something strange happens: when I trie to use this new variable for some analysis (for instance calculating the mean of X over id) Stata says: "no observation".

      Actually, when I run -sum id- Stata signals 0 cases: basically it is as if this variable it is not recognized by Stata.

      Do you have clue on why?

      Thanks again, g

      Comment


      • #4
        egen, concat() creates string variables, so no go there for summarize. In this context, "no observations" really means that the variable doesn't contain relevant data.

        In any case how would you (could you) possibly average values like "476594335:476594339" and "1156943847:1156943851"? An average identifier is not usually helpful, but an average string identifier isn't even defined.



        Comment


        • #5
          Again, the documentation for the egen function concat() in the output of help egen points the way, telling us it

          concatenates varlist to produce a string variable.
          Both commands in post #2 produce string variables. You cannot concatenate two 10-digit numbers into a 20-digit number stored in a numeric variable, because no numeric variable can store all 20 digits of a 20-digit number.

          The summarize command does not summarize character variables. I'm not sure what the mean of values like "476594335:476594339" might be. The describe and codebook commands will give some basic information about any variable, be it string or numeric.

          Here is some technique for using encode to generate a numeric variable whose values correspond to the string values and whose value labels reproduce the string variable values.
          Code:
          . encode id, generate(idnum)
          
          . list, clean
          
                       pidp        apidp                      id                   idnum  
            1.    476594335    476594339     476594335:476594339     476594335:476594339  
            2.   1156943847   1156943851   1156943847:1156943851   1156943847:1156943851  
            3.    748322335    748322339     748322335:748322339     748322335:748322339  
            4.   1428284927   1428284931   1428284927:1428284931   1428284927:1428284931  
            5.   1089950939   1089950943   1089950939:1089950943   1089950939:1089950943  
          
          . list, clean nolabel
          
                       pidp        apidp                      id   idnum  
            1.    476594335    476594339     476594335:476594339       4  
            2.   1156943847   1156943851   1156943847:1156943851       2  
            3.    748322335    748322339     748322335:748322339       5  
            4.   1428284927   1428284931   1428284927:1428284931       3  
            5.   1089950939   1089950943   1089950939:1089950943       1  
          
          . generate x = runiform()
          
          . mean x, over(idnum)
          
          Mean estimation                   Number of obs   =          5
          
              _subpop_1: idnum = 1089950939:1089950943
              _subpop_2: idnum = 1156943847:1156943851
              _subpop_3: idnum = 1428284927:1428284931
              _subpop_4: idnum = 476594335:476594339
              _subpop_5: idnum = 748322335:748322339
          
          --------------------------------------------------------------
                  Over |       Mean   Std. Err.     [95% Conf. Interval]
          -------------+------------------------------------------------
          x            |
             _subpop_1 |   .6866152          .             .           .
             _subpop_2 |   .1196613          .             .           .
             _subpop_3 |   .6950234          .             .           .
             _subpop_4 |   .3913819          .             .           .
             _subpop_5 |   .7542434          .             .           .
          --------------------------------------------------------------

          Comment


          • #6
            Dear Nick, WIlliams, thanks for your answer.

            regarding #4, of course I do not want to average the ID's, but looking at the average of a variable of interest (say, health) in each dyad.
            If ID is string, I can not no this.

            regarding #5, I tried - encode id, gen(idnum) - but Stata does not proceed saying -too many values -

            Also, I tried - destring id, gen(idnum) -, but also this does not work: Stata says - id contains nonnumeric characters; no generate -.

            Basically, I need thata recognizes each dyadic-ID as a case on which I can operate (of course, not numerically, but using it as unit of analysis).

            Is that possibile anyway?

            Thanks a lot, G

            Comment


            • #7
              If your goal is to generate a unique identifier, simply use

              Code:
              egen newid= group(pidp apidp)

              Comment


              • #8
                You asked

                Actually, when I run -sum id- Stata signals 0 cases: basically it is as if this variable it is not recognized by Stata.

                Do you have clue on why?
                and so we answered the question!

                I don't see what other command you are using to get means. You don't give exact syntax. The mean command will object to a string variable as group variable, but as you have more than 65536 distinct identifiers, then you would not be likely to find a table of that many means practical. You can go

                Code:
                egen mean = mean(whatever), by(id)
                and then the means will be in a new variable. In practice, looking at many thousand means may be impractical too, but be sure to avoid multiple counting, so that

                Code:
                egen tag = tag(id) 
                su mean if tag 
                histogram mean if tag
                would be the kind of syntax you could do.

                Comment


                • #9
                  The full documentation for encode (found in the Stata Data Management Reference Manual PDF included with your Stata installation and accessible through Stata's Help menu) tells us that the limit on the number of values is 65,536. Apparently the number of distinct dyads in your dataset is greater than that.

                  The destring command has nothing to offer you: as I wrote earlier, you cannot accurately store 20 digits of dyad identifier in a numeric value.

                  It seems to me your example of what you need the dyads for is unrealistic: it would take a while to review values for the mean of health in 65,536 different dyads.

                  If the actual dyad identifiers are unimportant, you can use the egen function group() to create numeric values lacking value labels.
                  Code:
                  . egen long idnum = group(pidp apidp)
                  
                  . list, clean
                  
                               pidp        apidp   idnum  
                    1.    476594335    476594339       1  
                    2.   1156943847   1156943851       4  
                    3.    748322335    748322339       2  
                    4.   1428284927   1428284931       5  
                    5.   1089950939   1089950943       3  
                  
                  . generate x = runiform()
                  
                  . mean x, over(idnum)
                  
                  Mean estimation                   Number of obs   =          5
                  
                              1: idnum = 1
                              2: idnum = 2
                              3: idnum = 3
                              4: idnum = 4
                              5: idnum = 5
                  
                  --------------------------------------------------------------
                          Over |       Mean   Std. Err.     [95% Conf. Interval]
                  -------------+------------------------------------------------
                  x            |
                             1 |   .3913819          .             .           .
                             2 |   .7542434          .             .           .
                             3 |   .6866152          .             .           .
                             4 |   .1196613          .             .           .
                             5 |   .6950234          .             .           .
                  --------------------------------------------------------------
                  But to get better advice on this problem, you should provide concrete examples of the sorts of uses you hope to put these dyads to: examples that makes sense in the context of 65,000+ dyads. Perhaps there are better approaches.

                  Comment


                  • #10
                    Dear all, thanks for your answers, but I did not get the point yet.

                    What I would like to obtain is a unique dyadic-ID that identifies each couple of person in my waves (I am using panel data over 7 waves). The command - egen newid= group(pidp apidp) - works, but if I use it I cannot trace the same couple in different waves (each person has of course the same ID over waves, so combining two persons' IDS in a dyadic-ID I can trace the dyad along the waves).

                    I am not specifically interested in calculating mean or whatsoever, mine was only an example to mean that I would like to obtain as unit of analysis this dyadic-ID, and after palying around with this.

                    Sorry if I was not clear enough.

                    Comment


                    • #11
                      . The command - egen newid= group(pidp apidp) - works, but if I use it I cannot trace the same couple in different waves
                      You can use labmask (Stata Journal; Nick Cox) to label the variable generated by egen(group) with the string variable that you got from William's code.

                      Code:
                      labmask newid, values(idnum)
                      The value labels should allow you to identify group combinations.

                      Comment


                      • #12
                        In response to post #11, since labmask is creating a value label, it too will be subject to the limit of 65,636 distinct values that caused encode to fail. See help limits to see that this is an overall property of value labels, as well as a limit to encode.
                        Last edited by William Lisowski; 30 Oct 2018, 12:34.

                        Comment


                        • #13
                          In response to post #10, two comments.

                          1) If your seven waves of panels are in different datasets, you will need to append them into a single dataset in order to ensure that the same numeric ID is assigned to a given dyad across each of the 7 waves. Neither encode nor egen group() will necessarily create the same number in different datasets, and there is no way to combine two 10-digit numbers into a 20-digit number.

                          2) In the absence of

                          examples of the sorts of uses you hope to put these dyads to
                          as I recommended in post #9, there is little further advice to offer.

                          You assert what you want, with no realistic examples of actual commands that you feel they are required for.

                          I assert that there are approaches to accomplishing what you need in other ways, but lacking examples of what you need, cannot proceed further.

                          Comment


                          • #14
                            In response to post #11, since labmask is creating a value label, it too will be subject to the limit of 65,636 distinct values that caused encode to fail. See help limits to see that this is an overall property of value labels, as well as a limit to encode.
                            You are correct William Lisowski. I agree that the best approach will be to append the data sets and use -egen(group)- to create a unique identifier. However, I highly doubt that 20 digit precision is needed in this context. I think you can lose the last 4 digits in each of your identifiers and the still create a unique couple identifier.

                            Code:
                            gen double newid=real(substr(string(pidp, "%10.0f") , 1, strlen(string(pidp, "%10.0f")) - 4) + substr(string(apidp, "%10.0f") , 1, strlen(string(apidp, "%10.0f")) - 4))
                            format newid2 %12.0f

                            Comment


                            • #15
                              Dear all,
                              thanks for all your responses.

                              I try to make me understood better: I appended 7 waves of panel data, and I would like to do some panel analyses with dyadic-ID as unit of analysis.

                              I used the command - generate id = trim(strofreal(pidp,"%10.0f")) + ":" + trim(strofreal(apidp,"%10.0f")) - to create a unique dyadic-ID across waves, and it seems to work.
                              The problem is that when I give Stata the command - xtset id year - it replies: varlist: id: string variable not allowed.

                              How should I create a unique dyadic-ID not string, that I can use as unit of analysis as panel?
                              Following my data:
                              Code:
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input str21 id float year
                              "68006135:68006127" 2009
                              "68006139:68006127" 2009
                              "68006139:68006131"    .
                              "68006135:68006131" 2009
                              "68007495:68007487" 2009
                              end

                              thanks, best,g

                              Thanks, G

                              Comment

                              Working...
                              X