Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Summing duplicate observations within a dataset with 1000+ variables

    Hello everyone! I've been trying to sum duplicate observations and I have read some solutions in past posts, but most of them show how to sum observations based on a single variable.

    My dataset looks like this

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str237 Banco str39 CredActComRest07_06 str36 VII_IndRentabilidad11_06
    ""                                          ""        ""                 
    "             Bancos de desarrollo"         "46415"   "20.13591217982227"
    "     Banca Comercial"                      "1696925" "26.2791427741513" 
    "     Banca Hipotecaria"                    "0"       "2.170354332787665"
    "     Banca Universal"                      "8947708" "32.66473880618365"
    "     Banca de Inversión"                  "14442"   "5.154437519901742"
    "     Empresas de Arrendamiento Financiero" "0"       "10.20247933884298"
    "     Entidades de Ahorro y Préstamo"      "398240"  "91.81085437936594"
    "     Fondos del Mercado Monetario"         "0"       ".3621510085484482"
    "100% BANCO"                                ""        "1.981099740488665"
    "100% BANCO"                                ""        ""                 
    "100% BANCO"                                ""        ""                 
    end
    I couldn't attach all the variables but here is an example. I would like to sum the duplicate observation "100% BANCO" with a code that takes into account all the variables. I don't know if that's possible but it would be very helpful.

    Thanks,
    Sergio

  • #2
    https://www.statalist.org/forums/for...m-of-variables

    replace for your relevant vars

    Comment


    • #3
      Thanks for the link Wessel, but I think it doesn't quite help me. Maybe I'm just not getting it right. But creating a conditional for each observation within so many variables would take me forever to accomplish, unless (as I stated before) I didn't understand exactly how that code could help me.

      Nonetheless, thank you for showing interest

      Comment


      • #4
        I don't understand what you mean by "sum" here. You have all string variables and you have missing values for 100% BANCO. What would new_variable look like for 100% BANCO? What would it look like for non-duplicate observations?
        Stata/MP 14.1 (64-bit x86-64)
        Revision 19 May 2016
        Win 8.1

        Comment


        • #5
          Originally posted by Carole J. Wilson View Post
          I don't understand what you mean by "sum" here. You have all string variables and you have missing values for 100% BANCO. What would new_variable look like for 100% BANCO? What would it look like for non-duplicate observations?
          There are some missing values but these observations have some values for several variables across the data set. The new_variable would look like 100% BANCO, I want to combine those three observations across the whole data set. Something like this

          Banco X Y Z ...
          100% BANCO 0 0 3
          100% BANCO 1 0 0
          100% BANCO 0 0 4

          and transform it into this (without changing the other observations different from 100% BANCO)

          Banco X Y Z ...
          100% BANCO 1 0 7

          Like I said before, I don't know if it is possible

          Thank you in advance

          Comment


          • #6
            I shared Carole's puzzlement about the intent here. In what I'm hoping is a lucky guess, I'm going to proceed thinking that Sergio has several continuous variables (say x, y z) and he wants to sum their values within groups defined by Banco and some other collection of duplicating variables also not indicated here. He then wants to usse those sums as the values of x, y, z, with just one observation per value of the grouping variable(s) On that presumption, -collapse- will do the trick.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str237 Banco str39 CredActComRest07_06 str36 VII_IndRentabilidad11_06
            ""                                          ""        ""                
            "             Bancos de desarrollo"         "46415"   "20.13591217982227"
            "     Banca Comercial"                      "1696925" "26.2791427741513"
            "     Banca Hipotecaria"                    "0"       "2.170354332787665"
            "     Banca Universal"                      "8947708" "32.66473880618365"
            "     Banca de Inversión"                  "14442"   "5.154437519901742"
            "     Empresas de Arrendamiento Financiero" "0"       "10.20247933884298"
            "     Entidades de Ahorro y Préstamo"      "398240"  "91.81085437936594"
            "     Fondos del Mercado Monetario"         "0"       ".3621510085484482"
            "100% BANCO"                                ""        "1.981099740488665"
            "100% BANCO"                                ""        ""                
            "100% BANCO"                                ""        ""                
            end
            // create some example variables
            gen x = runiform()
            gen y = runiform()
            gen z = runiform()
            //
            collapse (sum) x y z, by(Banco)  // Besides Banco, put in whatever other duplicating variables are of interest
            Last edited by Mike Lacy; 15 Aug 2018, 10:09.

            Comment


            • #7
              Let me add several thoughts to Mike's solution.

              First, however you imported your data, you should have avoided the second row, which was all blank, and caused all your numeric variables to be imported as string variables. You must resolve this problem before you can accomplish anything useful with this data. As it stands now, you need to destring 1000+ numeric variables, not an appealing prospect. This is why Mike generated numeric variables x, y, and z to demonstrate his collapse command - you cannot sum strings.

              Second, it would be good practice to begin with
              Code:
              replace Banco = trim(Banco)
              to get rid of leading and trailing blanks, although the collapse Mike recommends will take care of that problem.

              Third, since you have, per the title of this thread, 1000+ variables, you can in theory avoid listing 1000+ variables by using varlist notation (see the output of help varlist for details)
              Code:
              collapse (sum) x-z, by(Banco)
              where x is the first variable to sum and z is the last variable to sum, in the order they appear in the data.

              This may not work because your variable names are so long. There are two possibilities. One is to use the rename command to systematically rename them (see the output of help rename group for details), for example, shortening CredActComRest to CACR
              Code:
              rename (CredActComRest*) (CACR*)
              The alternative to renaming variables is to reshape them from a wide layout to a long layout, which is probably what you will need for you eventual analysis. I encourage you to read and consider the output of help reshape to accomplish this. You really don't want to be working with 1000+ variables, especially when they seem to be time series.
              Last edited by William Lisowski; 15 Aug 2018, 10:42.

              Comment

              Working...
              X