Summing duplicate observations within a dataset with 1000+ variables

Sergio Paolinetti

Join Date: Aug 2018
Posts: 5

Summing duplicate observations within a dataset with 1000+ variables

15 Aug 2018, 09:14

Hello everyone! I've been trying to sum duplicate observations and I have read some solutions in past posts, but most of them show how to sum observations based on a single variable.

My dataset looks like this

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str237 Banco str39 CredActComRest07_06 str36 VII_IndRentabilidad11_06
""                                          ""        ""                 
"             Bancos de desarrollo"         "46415"   "20.13591217982227"
"     Banca Comercial"                      "1696925" "26.2791427741513" 
"     Banca Hipotecaria"                    "0"       "2.170354332787665"
"     Banca Universal"                      "8947708" "32.66473880618365"
"     Banca de Inversión"                  "14442"   "5.154437519901742"
"     Empresas de Arrendamiento Financiero" "0"       "10.20247933884298"
"     Entidades de Ahorro y Préstamo"      "398240"  "91.81085437936594"
"     Fondos del Mercado Monetario"         "0"       ".3621510085484482"
"100% BANCO"                                ""        "1.981099740488665"
"100% BANCO"                                ""        ""                 
"100% BANCO"                                ""        ""                 
end

I couldn't attach all the variables but here is an example. I would like to sum the duplicate observation "100% BANCO" with a code that takes into account all the variables. I don't know if that's possible but it would be very helpful.

Thanks,
Sergio

Tags: None

Wessel de Kroo

Join Date: Feb 2017

Posts: 92
#2

15 Aug 2018, 09:19

https://www.statalist.org/forums/for...m-of-variables

replace for your relevant vars
Comment
Sergio Paolinetti

Join Date: Aug 2018

Posts: 5
#3

15 Aug 2018, 09:29

Thanks for the link Wessel, but I think it doesn't quite help me. Maybe I'm just not getting it right. But creating a conditional for each observation within so many variables would take me forever to accomplish, unless (as I stated before) I didn't understand exactly how that code could help me.

Nonetheless, thank you for showing interest
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#4

15 Aug 2018, 09:50

I don't understand what you mean by "sum" here. You have all string variables and you have missing values for 100% BANCO. What would new_variable look like for 100% BANCO? What would it look like for non-duplicate observations?

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Sergio Paolinetti

Join Date: Aug 2018

Posts: 5
#5

15 Aug 2018, 10:00

Originally posted by Carole J. Wilson View Post

I don't understand what you mean by "sum" here. You have all string variables and you have missing values for 100% BANCO. What would new_variable look like for 100% BANCO? What would it look like for non-duplicate observations?

There are some missing values but these observations have some values for several variables across the data set. The new_variable would look like 100% BANCO, I want to combine those three observations across the whole data set. Something like this

Banco X Y Z ...
100% BANCO 0 0 3
100% BANCO 1 0 0
100% BANCO 0 0 4

and transform it into this (without changing the other observations different from 100% BANCO)

Banco X Y Z ...
100% BANCO 1 0 7

Like I said before, I don't know if it is possible

Thank you in advance
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

15 Aug 2018, 10:04

I shared Carole's puzzlement about the intent here. In what I'm hoping is a lucky guess, I'm going to proceed thinking that Sergio has several continuous variables (say x, y z) and he wants to sum their values within groups defined by Banco and some other collection of duplicating variables also not indicated here. He then wants to usse those sums as the values of x, y, z, with just one observation per value of the grouping variable(s) On that presumption, -collapse- will do the trick.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str237 Banco str39 CredActComRest07_06 str36 VII_IndRentabilidad11_06
""                                          ""        ""                
"             Bancos de desarrollo"         "46415"   "20.13591217982227"
"     Banca Comercial"                      "1696925" "26.2791427741513"
"     Banca Hipotecaria"                    "0"       "2.170354332787665"
"     Banca Universal"                      "8947708" "32.66473880618365"
"     Banca de Inversión"                  "14442"   "5.154437519901742"
"     Empresas de Arrendamiento Financiero" "0"       "10.20247933884298"
"     Entidades de Ahorro y Préstamo"      "398240"  "91.81085437936594"
"     Fondos del Mercado Monetario"         "0"       ".3621510085484482"
"100% BANCO"                                ""        "1.981099740488665"
"100% BANCO"                                ""        ""                
"100% BANCO"                                ""        ""                
end
// create some example variables
gen x = runiform()
gen y = runiform()
gen z = runiform()
//
collapse (sum) x y z, by(Banco)  // Besides Banco, put in whatever other duplicating variables are of interest

Last edited by Mike Lacy; 15 Aug 2018, 10:09.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

15 Aug 2018, 10:29

Let me add several thoughts to Mike's solution.

First, however you imported your data, you should have avoided the second row, which was all blank, and caused all your numeric variables to be imported as string variables. You must resolve this problem before you can accomplish anything useful with this data. As it stands now, you need to destring 1000+ numeric variables, not an appealing prospect. This is why Mike generated numeric variables x, y, and z to demonstrate his collapse command - you cannot sum strings.

Second, it would be good practice to begin with

Code:

replace Banco = trim(Banco)

to get rid of leading and trailing blanks, although the collapse Mike recommends will take care of that problem.

Third, since you have, per the title of this thread, 1000+ variables, you can in theory avoid listing 1000+ variables by using varlist notation (see the output of help varlist for details)

Code:

collapse (sum) x-z, by(Banco)

where x is the first variable to sum and z is the last variable to sum, in the order they appear in the data.

This may not work because your variable names are so long. There are two possibilities. One is to use the rename command to systematically rename them (see the output of help rename group for details), for example, shortening CredActComRest to CACR

Code:

rename (CredActComRest*) (CACR*)

The alternative to renaming variables is to reshape them from a wide layout to a long layout, which is probably what you will need for you eventual analysis. I encourage you to read and consider the output of help reshape to accomplish this. You really don't want to be working with 1000+ variables, especially when they seem to be time series.

Last edited by William Lisowski; 15 Aug 2018, 10:42.
1 like
Comment

Announcement

Summing duplicate observations within a dataset with 1000+ variables

Comment

Comment

Comment

Comment

Comment

Comment