Egen concat without scientific notatio

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#1

Egen concat without scientific notatio

30 Oct 2018, 08:27

Dear Stata users,
I am struggling with the following issue:

I need to combine two personal ID's in order to create a unique dyadic-ID (an ID that uniquely identifies a relationship between person A and person B).
I tried the command - egen concat -, and apparently it seems to work. The problem is that, since my ID's are very long, the resultant dyadic-ID is not showed in its whole lenght, but in scientific format with the letter e. So, I end up with several dyadic-ID's with the same value, since their real number of digits is higher than the number of digit showed by Stata (at least, I intepreted in this way what happened).

Do you have any clue on how to solve?

Here the data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long(pidp apidp) 476594335 476594339 1156943847 1156943851 748322335 748322339 1428284927 1428284931 1089950939 1089950943 end

Thanks a lot, G
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

30 Oct 2018, 08:41

Perhaps this will point you in a useful direction.

Code:

. generate id = trim(strofreal(pidp,"%10.0f")) + ":" + trim(strofreal(apidp,"%10.0f"))

. list, clean

             pidp        apidp                      id  
  1.    476594335    476594339     476594335:476594339  
  2.   1156943847   1156943851   1156943847:1156943851  
  3.    748322335    748322339     748322335:748322339  
  4.   1428284927   1428284931   1428284927:1428284931  
  5.   1089950939   1089950943   1089950939:1089950943

Or, after reading the documentation for the egen function concat() in the output of help egen

Code:

. egen id = concat(pidp apidp), format(%10.0f) punct(:)

. list, clean

             pidp        apidp                      id  
  1.    476594335    476594339     476594335:476594339  
  2.   1156943847   1156943851   1156943847:1156943851  
  3.    748322335    748322339     748322335:748322339  
  4.   1428284927   1428284931   1428284927:1428284931  
  5.   1089950939   1089950943   1089950939:1089950943

Last edited by William Lisowski; 30 Oct 2018, 08:47.

Comment

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#3

30 Oct 2018, 08:50

Dear William,
thanks a lot for your answer.

With your command I succeed in obtaining a unique dyadic-ID (with all the digits), but something strange happens: when I trie to use this new variable for some analysis (for instance calculating the mean of X over id) Stata says: "no observation".

Actually, when I run -sum id- Stata signals 0 cases: basically it is as if this variable it is not recognized by Stata.

Do you have clue on why?

Thanks again, g
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35721
#4

30 Oct 2018, 09:13

egen, concat() creates string variables, so no go there for summarize. In this context, "no observations" really means that the variable doesn't contain relevant data.

In any case how would you (could you) possibly average values like "476594335:476594339" and "1156943847:1156943851"? An average identifier is not usually helpful, but an average string identifier isn't even defined.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

30 Oct 2018, 09:16

Again, the documentation for the egen function concat() in the output of help egen points the way, telling us it

concatenates varlist to produce a string variable.

Both commands in post #2 produce string variables. You cannot concatenate two 10-digit numbers into a 20-digit number stored in a numeric variable, because no numeric variable can store all 20 digits of a 20-digit number.

The summarize command does not summarize character variables. I'm not sure what the mean of values like "476594335:476594339" might be. The describe and codebook commands will give some basic information about any variable, be it string or numeric.

Here is some technique for using encode to generate a numeric variable whose values correspond to the string values and whose value labels reproduce the string variable values.

Code:

. encode id, generate(idnum)

. list, clean

             pidp        apidp                      id                   idnum  
  1.    476594335    476594339     476594335:476594339     476594335:476594339  
  2.   1156943847   1156943851   1156943847:1156943851   1156943847:1156943851  
  3.    748322335    748322339     748322335:748322339     748322335:748322339  
  4.   1428284927   1428284931   1428284927:1428284931   1428284927:1428284931  
  5.   1089950939   1089950943   1089950939:1089950943   1089950939:1089950943  

. list, clean nolabel

             pidp        apidp                      id   idnum  
  1.    476594335    476594339     476594335:476594339       4  
  2.   1156943847   1156943851   1156943847:1156943851       2  
  3.    748322335    748322339     748322335:748322339       5  
  4.   1428284927   1428284931   1428284927:1428284931       3  
  5.   1089950939   1089950943   1089950939:1089950943       1  

. generate x = runiform()

. mean x, over(idnum)

Mean estimation                   Number of obs   =          5

    _subpop_1: idnum = 1089950939:1089950943
    _subpop_2: idnum = 1156943847:1156943851
    _subpop_3: idnum = 1428284927:1428284931
    _subpop_4: idnum = 476594335:476594339
    _subpop_5: idnum = 748322335:748322339

--------------------------------------------------------------
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
x            |
   _subpop_1 |   .6866152          .             .           .
   _subpop_2 |   .1196613          .             .           .
   _subpop_3 |   .6950234          .             .           .
   _subpop_4 |   .3913819          .             .           .
   _subpop_5 |   .7542434          .             .           .
--------------------------------------------------------------

Comment

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#6

30 Oct 2018, 09:35

Dear Nick, WIlliams, thanks for your answer.

regarding #4, of course I do not want to average the ID's, but looking at the average of a variable of interest (say, health) in each dyad.
If ID is string, I can not no this.

regarding #5, I tried - encode id, gen(idnum) - but Stata does not proceed saying -too many values -

Also, I tried - destring id, gen(idnum) -, but also this does not work: Stata says - id contains nonnumeric characters; no generate -.

Basically, I need thata recognizes each dyadic-ID as a case on which I can operate (of course, not numerically, but using it as unit of analysis).

Is that possibile anyway?

Thanks a lot, G
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#7

30 Oct 2018, 09:58

If your goal is to generate a unique identifier, simply use

Code:

egen newid= group(pidp apidp)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35721
#8

30 Oct 2018, 10:05

You asked

Actually, when I run -sum id- Stata signals 0 cases: basically it is as if this variable it is not recognized by Stata.

Do you have clue on why?

and so we answered the question!

I don't see what other command you are using to get means. You don't give exact syntax. The mean command will object to a string variable as group variable, but as you have more than 65536 distinct identifiers, then you would not be likely to find a table of that many means practical. You can go

Code:

egen mean = mean(whatever), by(id)

and then the means will be in a new variable. In practice, looking at many thousand means may be impractical too, but be sure to avoid multiple counting, so that

Code:

egen tag = tag(id) su mean if tag histogram mean if tag

would be the kind of syntax you could do.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

30 Oct 2018, 10:07

The full documentation for encode (found in the Stata Data Management Reference Manual PDF included with your Stata installation and accessible through Stata's Help menu) tells us that the limit on the number of values is 65,536. Apparently the number of distinct dyads in your dataset is greater than that.

The destring command has nothing to offer you: as I wrote earlier, you cannot accurately store 20 digits of dyad identifier in a numeric value.

It seems to me your example of what you need the dyads for is unrealistic: it would take a while to review values for the mean of health in 65,536 different dyads.

If the actual dyad identifiers are unimportant, you can use the egen function group() to create numeric values lacking value labels.

Code:

. egen long idnum = group(pidp apidp)

. list, clean

             pidp        apidp   idnum  
  1.    476594335    476594339       1  
  2.   1156943847   1156943851       4  
  3.    748322335    748322339       2  
  4.   1428284927   1428284931       5  
  5.   1089950939   1089950943       3  

. generate x = runiform()

. mean x, over(idnum)

Mean estimation                   Number of obs   =          5

            1: idnum = 1
            2: idnum = 2
            3: idnum = 3
            4: idnum = 4
            5: idnum = 5

--------------------------------------------------------------
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
x            |
           1 |   .3913819          .             .           .
           2 |   .7542434          .             .           .
           3 |   .6866152          .             .           .
           4 |   .1196613          .             .           .
           5 |   .6950234          .             .           .
--------------------------------------------------------------

But to get better advice on this problem, you should provide concrete examples of the sorts of uses you hope to put these dyads to: examples that makes sense in the context of 65,000+ dyads. Perhaps there are better approaches.

Comment

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#10

30 Oct 2018, 10:15

Dear all, thanks for your answers, but I did not get the point yet.

What I would like to obtain is a unique dyadic-ID that identifies each couple of person in my waves (I am using panel data over 7 waves). The command - egen newid= group(pidp apidp) - works, but if I use it I cannot trace the same couple in different waves (each person has of course the same ID over waves, so combining two persons' IDS in a dyadic-ID I can trace the dyad along the waves).

I am not specifically interested in calculating mean or whatsoever, mine was only an example to mean that I would like to obtain as unit of analysis this dyadic-ID, and after palying around with this.

Sorry if I was not clear enough.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#11

30 Oct 2018, 10:35

. The command - egen newid= group(pidp apidp) - works, but if I use it I cannot trace the same couple in different waves

You can use labmask (Stata Journal; Nick Cox) to label the variable generated by egen(group) with the string variable that you got from William's code.

Code:

labmask newid, values(idnum)

The value labels should allow you to identify group combinations.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

30 Oct 2018, 12:32

In response to post #11, since labmask is creating a value label, it too will be subject to the limit of 65,636 distinct values that caused encode to fail. See help limits to see that this is an overall property of value labels, as well as a limit to encode.

Last edited by William Lisowski; 30 Oct 2018, 12:34.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

30 Oct 2018, 12:45

In response to post #10, two comments.

1) If your seven waves of panels are in different datasets, you will need to append them into a single dataset in order to ensure that the same numeric ID is assigned to a given dyad across each of the 7 waves. Neither encode nor egen group() will necessarily create the same number in different datasets, and there is no way to combine two 10-digit numbers into a 20-digit number.

2) In the absence of

examples of the sorts of uses you hope to put these dyads to

as I recommended in post #9, there is little further advice to offer.

You assert what you want, with no realistic examples of actual commands that you feel they are required for.

I assert that there are approaches to accomplishing what you need in other ways, but lacking examples of what you need, cannot proceed further.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#14

30 Oct 2018, 16:39

In response to post #11, since labmask is creating a value label, it too will be subject to the limit of 65,636 distinct values that caused encode to fail. See help limits to see that this is an overall property of value labels, as well as a limit to encode.

You are correct William Lisowski. I agree that the best approach will be to append the data sets and use -egen(group)- to create a unique identifier. However, I highly doubt that 20 digit precision is needed in this context. I think you can lose the last 4 digits in each of your identifiers and the still create a unique couple identifier.

Code:

gen double newid=real(substr(string(pidp, "%10.0f") , 1, strlen(string(pidp, "%10.0f")) - 4) + substr(string(apidp, "%10.0f") , 1, strlen(string(apidp, "%10.0f")) - 4)) format newid2 %12.0f
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#15

05 Nov 2018, 05:23

Dear all,
thanks for all your responses.

I try to make me understood better: I appended 7 waves of panel data, and I would like to do some panel analyses with dyadic-ID as unit of analysis.

I used the command - generate id = trim(strofreal(pidp,"%10.0f")) + ":" + trim(strofreal(apidp,"%10.0f")) - to create a unique dyadic-ID across waves, and it seems to work.
The problem is that when I give Stata the command - xtset id year - it replies: varlist: id: string variable not allowed.

How should I create a unique dyadic-ID not string, that I can use as unit of analysis as panel?
Following my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str21 id float year "68006135:68006127" 2009 "68006139:68006127" 2009 "68006139:68006131" . "68006135:68006131" 2009 "68007495:68007487" 2009 end

thanks, best,g

Thanks, G
Comment

Announcement