Duplicates across multiple variables

Panos Markou

Join Date: May 2014
Posts: 19

Duplicates across multiple variables

04 Oct 2017, 07:48

Hello everyone,

I have a data set of unique customers (rows) and up to 10 of their suppliers (columns). My data set is in wide format because I am taking averages across years (columns). I am trying to calculate the average size of the suppliers in my data set per year, but this is complicated by the fact that two customers can have the same supplier. When taking an average, thus, I do not want to double-count suppliers.

Additionally, it is not necessary that a customer has 10 suppliers, it can be less (see C4, Supplier 2 does not exist below).

An example of my data set is as follows.

Customer	Assets (2009)	Assets (2008)	Supplier 1	Assets (2009)	Assets (2008)	Supplier 2	Assets (2009)	Assets (2008)	... Supplier 10
C1			AAA			BBB
C2			CCC			DDD
C3			AAA			EEE
C4			FFF			.
C5			GGG			AAA

So for example, how can I get the average assets of the suppliers in 2009? Is there a way to remove duplicates across multiple variables?

Thank you!

Panos

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

04 Oct 2017, 08:49

Giving us a general impression of the layout of your data is nice, but it's not a substitute for an actual Stata (not spreadsheet) data example prepared with the -dataex- command. So before asking for help with this, you should import your data to Stata and then select an example to show with -dataex-. (If you are not familiar with -dataex-, see FAQ #12.)

Next, your question is confusing in several ways.

1. You refer to supplier size, but there is no variable for size in what you show.

2. What is the average you are trying to calculate here? Why does it matter if two different customers can have the same supplier if you are calculating the average across rows? Or do you mean that the same supplier can appear twice in a single row--that would lead to double counting. Or are you trying to simply identify a census of all suppliers to all customers and then calculate the average for them, counting each one once no matter how many customers they serve?
Comment

Panos Markou

Join Date: May 2014
Posts: 19

04 Oct 2017, 09:30

Dear Clyde, thanks for the reply. Let me try to clear up what I am trying to do.

I want the average assets in 2009 (and then 2008) of all (the population/census) of suppliers in the data set, counting each supplier only once no matter how many customers they serve. I am not looking for an average across rows. I am looking for the average across rows and columns, i.e. all the suppliers.

Ideally, I will end up with two columns called AvgAssetsAllSupps09 and AvgAssetsAllSupps08. Within each column, each row will be populated with the exact same number, since I am looking for the average assets of the population of suppliers.

In the extract of my data set below, I have used -sort Supp1ID- to show that a supplier can serve multiple customers. Supplier 5650189, for example, suppliers Customer 386650303 and Customer 410617005. The ID variables are identifier numbers (similar to gvkey) and although they are numeric, they have no meaning as such.

Please let me know if this clears things up.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(CustID C_Assets09 C_Assets08 Supp1ID S1_Assets09 S1_Assets08 Supp2ID S2_Assets09 S2_Assets08)
386650303   366   375  5650189  8110  7780         .     .     .
410617005  1030   956  5650189  8110  7780         .     .     .
349767582     .   756  6080170  2116  2185         .     .     .
330097395  7533  6251  6180194  1137  1554         .     .     .
304351943  6175  6173  6580195 45023 54932 389044140  9055 11247
437687601   669   684  6850234  3726  3598         .     .     .
343117370   992  1037  6850259  1283  1325         .     .     .
428773733 18813 19594  6850259  1283  1325         .     .     .
453485229   387   678  7080773  6512  7196 418101689  3893  4135
430075796   378   429  7280365  6867  7141         .     .     .
380622589 81213 75329  7320294  4686  4301 433972130  1935  1734
331212209  1350  1677 15550601     .  7285 572192045  3206  4205
352401731  6421  4030 15550882 58046 75065 413449448  9900 12692
411447840  5559 12037 15550882 58046 75065         .     .     .
300031622  2481  2524 15550882 58046 75065         .     .     .
324180603  5707  5816 15551450 34372 34518 632052734 21866 22212
380498923  1042  1638 15551666  1576  1624         .     .     .
314547290   399   356 15650039 12601 12178         .     .     .
326615556  4990  4905 15650039 12601 12178  45750981  5176  5202
429433808   992  1146 15650617  6935  8139         .     .     .
end

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35734
#4

04 Oct 2017, 09:36

See the tag() function of egen
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30122

04 Oct 2017, 09:58

If you are going to work successfully in Stata you will need to break out of spreadsheet-think. Stata data sets do not have rows and columns. They have observations and variables. I make a point of this not to be pedantic but because the more you think about Stata as if it were a spreadsheet, the more mistakes you will make and the more difficulty you will have finding solutions to problems.

As I alluded to in my earlier response, this task is much easier in long layout, as are most things in Stata. So reshaping the supplier data to long layout is the very first step. Once that is done, as Nick notes in #4, the -egen, tag()- function will enable you to mark a single observation for each supplier. There is another point to think about here: since the same supplier reoccurs many times in the data, it is possible that there are inconsistent evaluations of Assets08 or Assets09 among those many references. Even if your data comes from a highly reliable source, this kind of mistake happens often in real life. You should never trust that multiple references to the same thing will be consistent: you should always verify that.

The following code does, mostly, what you require:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(CustID C_Assets09 C_Assets08 Supp1ID S1_Assets09 S1_Assets08 Supp2ID S2_Assets09 S2_Assets08)
386650303   366   375  5650189  8110  7780         .     .     .
410617005  1030   956  5650189  8110  7780         .     .     .
349767582     .   756  6080170  2116  2185         .     .     .
330097395  7533  6251  6180194  1137  1554         .     .     .
304351943  6175  6173  6580195 45023 54932 389044140  9055 11247
437687601   669   684  6850234  3726  3598         .     .     .
343117370   992  1037  6850259  1283  1325         .     .     .
428773733 18813 19594  6850259  1283  1325         .     .     .
453485229   387   678  7080773  6512  7196 418101689  3893  4135
430075796   378   429  7280365  6867  7141         .     .     .
380622589 81213 75329  7320294  4686  4301 433972130  1935  1734
331212209  1350  1677 15550601     .  7285 572192045  3206  4205
352401731  6421  4030 15550882 58046 75065 413449448  9900 12692
411447840  5559 12037 15550882 58046 75065         .     .     .
300031622  2481  2524 15550882 58046 75065         .     .     .
324180603  5707  5816 15551450 34372 34518 632052734 21866 22212
380498923  1042  1638 15551666  1576  1624         .     .     .
314547290   399   356 15650039 12601 12178         .     .     .
326615556  4990  4905 15650039 12601 12178  45750981  5176  5202
429433808   992  1146 15650617  6935  8139         .     .     .
end

reshape long Supp@ID S@_Assets08 S@_Assets09, i(CustID) j(supplier_num)

//    VERIFY THAT ALL REFERENCES TO ASSETS IN 08 OR 09 ARE
//    CONSISTENT WITHIN SUPPLIER ID
foreach v of varlist S_Assets* {
    by SuppID (`v'), sort: assert `v'[1] == `v'[_N]
}

//    IDENTIFY ONE OBSERVATION PER SUPPLIER
egen flag = tag(SuppID)
foreach x in 08 09 {
    summ S_Assets`x if flag', meanonly
    local avg_assets_all_supps_`x' `r(mean)'
    display `avg_assets_all_supps_`x''
}

Where I depart from your request is at the final step. In Stata it is rarely useful to create a variable that has the same value in all observations. There are exceptions, but I'm not going to assume this is one of them. When you have a single value to store for future use, usually it is better to store it as either a local macro (as I have done here) or as a scalar. If there really is a good reason to make variables out of these two constants, you can just replace my -local avg_assets_all_supps_`x'...- command with a -gen avg_assets_all_supps_`x' = ...- command.

Comment

Panos Markou

Join Date: May 2014

Posts: 19
#6

04 Oct 2017, 10:45

Dear Nick and Clyde, thank you for the help.

Not to seem ungrateful Clyde, but it seems the closing apostrophe in your code should be after the x as such:

Code:

summ S_Assets`x' if flag, meanonly

Thank you so much!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#7

04 Oct 2017, 11:20

but it seems the closing apostrophe in your code should be after the x as such

You are quite right. The strange thing is that in the do-file I worked with to test this code, that closing apostrophe is there. I'm can't figure out how it managed not to appear in the post.
Comment

Announcement

Duplicates across multiple variables

Comment

Comment

Comment

Comment

Comment

Comment