Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about -merge- and dropping duplicates and variables to focus only on IDs itiretally

    Hi everyone,

    I have big datasets (on average, 3,500,000 observations for each monthly file from January 2021 to July 2023) and I need to make them smaller to be able to work with.

    I want to keep only the IDs, drop duplicates, and save them as monthly IDs, then it would become a quite small dataset that you can put together.
    The variables of my monthly files are always the same, as well as their name ("export_telemedida_yyyymm", from export_telemedida_202101 to export_telemedida_202307).

    I want to produce this jointly with this nice post written by Daniel Schaefer:

    Here is the post: https://www.statalist.org/forums/for...70#post1732070

    Here is a dataex:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long v1 str96 id long fecha_consumo float(v4 v5)
    0 "C22D34923F76E418EBAF2DF336E314232FD494E42A38CB85B182821164E849971C7EEEDB9C49717C1E20F82A7127E51E" 20210122 .103 .103
    end
    Could anyone help me please?

    Thank you in advance.

    Michael
    Last edited by Michael Duarte Goncalves; 31 Oct 2023, 05:28.

  • #2
    I add to post #1 a problem that I may have.My id variable is a string. It should be written the same way across monthly files. But, it may contain spaces entered incorrectly. Is there a way to check that?
    1. It it useful to use for example egen id2= group(id) (encode is not suitable, too many values) in this case to omit the string's spaces problem?
    2. Is it possible to use a loop that allows me to see this in one go?
    Thanks again.

    Michael
    Last edited by Michael Duarte Goncalves; 31 Oct 2023, 05:34.

    Comment


    • #3
      I don't understand #1 -- what does itiretally mean? It may be a typo, but I can't decode it.

      #2 is easier, and note that the egen function group() will not ignore differences in spacing which to Stata are differences in string values, regardless of how trivial they seem to a person reading the data.

      The purpose of group() is not to lump together values that are similar in meaning, but just to map observations identical in value or values to the same integer.

      To adjust for differences in spacing, you can make progress by forcing strings through trim(itrim()) but that won't solve all problems.

      Comment


      • #4
        Hi Nick,

        Thank you for your feedback. Sorry for the typo: I meant "in an iterative way" in the post #1.
        To avoid to do that "manually" for every single file.

        Ok, so the way I understood the function of egen function group() is not the right way...

        Sorry I don't explain myself well. Basically, I want the following:
        1. I have a database which must contain the households subscribed to an electricity contract
        2. I then have a database with monthly files showing electricity consumption by contract, from January 2021 to January 2023. I'd like to check (without even needing to use merge) whether the ids between its various files coincide.
        3. Then, I want to merge my master dataset (which is one of my monthly files) with the households dataset: merge m:1 id using “export_contratos.dta”
        4. I finally only want to keep the ids, month and year, and drop any duplicated IDs. month and year should be in two different variables: one for month, another for year.
        5. date is stored as the variable's name "fecha_consumo", and is not in date format. Instead, it is in long %12.0g format.
        Here is an example from my master "export_telemedida_202102.dta"·:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str96 id long fecha_consumo
        "D15350E39FBFDAED44B3FB7FF1A0623BC1F8F41E32127A2B85D2D75E8F1165D79F2DC9ADCC236DD6247C2389391A90B8" 20210204
        "4D39096F9520A522D69B7EE0757F202D5198557E3A018F0A48426BD5C9AC529368DB300CEEE2E866C7F0F2E38A0A345F" 20210216
        "C75FC2ED051A76658F80931E52317DB5DF09B3FDA325B6249A75DFDE53004A71B04AD5CF03BF51FE5AF279B09CC42AC4" 20210204
        "C1899C0F8DB18902234A86F529C0626B210BF79926E11AF871EA492D6C5D7D13AC0F3E58CC6361FB9B9915A7AFEA4782" 20210204
        "E5469BFD2C1824630A3D8F3CC20C42143E890FA52190A43F24B7B69B29F1B650BC9AA04C67747430553826D26B25C79A" 20210204
        end

        Here is an example of my using file "export_contractos.dta":


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str96 id long sp_zipcode str5 tariff_ekon_id
        "988383C055292068FEC5A64831BB101AB2D83CB35707D584B81525FFC8FE936847870F57B9AC7240BE31F12644B87538" 13300 "20TD"
        "F48AD271B727894C6958C4F0C62023AB7B1B7BE91E0041B7F37D8F2819643F6DD2F4FEA0A71A491B2E098C474F9C6306" 41020 "20TD"
        "BB5116E205F0177339A289DCC28ECBE60576B914D82A35BFA237543C6DDF52665878E3A99B6EBCC7FFE88BF83708F8BA"  4117 "20TD"
        "AD4879F418843A5363B17B64708DDF87B81D2D4F0C86C7C5B2B557A8BA83C245770DFF26B37058BE6FA89F7CEC8AF2C4" 41388 "20TD"
        "0651254E8A17363FED59FAB8515603036111DE8C09156DF46A138218B16BDB2A91DA6CE52F34C4DF530EA0B4914262FC" 43569 "20TD"
        end

        Best,

        Michael
        Last edited by Michael Duarte Goncalves; 31 Oct 2023, 08:21.

        Comment


        • #5
          Hi Nick,

          Sorry I don't explain myself well. Basically, I want the following:
          1. I have a database which must contain the households subscribed to an electricity contract
          2. I then have a database with monthly files showing electricity consumption by contract, from January 2021 to January 2023. I'd like to check (without even needing to use merge) whether the ids between its various files coincide.
          3. Then, I want to merge my master dataset (which is one of my monthly files) with the households dataset: merge m:1 id using “export_contratos.dta”
          4. I finally only want to keep the ids, month and year, and drop any duplicated IDs. month and year should be in two different variables: one for month, another for year.
          5. date is stored as the variable's name "fecha_consumo", and is not in date format. Instead, it is in long %12.0g format.
          Here is an example from my master "export_telemedida_202102.dta"·:

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str96 id long fecha_consumo
          "D15350E39FBFDAED44B3FB7FF1A0623BC1F8F41E32127A2B85D2D75E8F1165D79F2DC9ADCC236DD6247C2389391A90B8" 20210204
          "4D39096F9520A522D69B7EE0757F202D5198557E3A018F0A48426BD5C9AC529368DB300CEEE2E866C7F0F2E38A0A345F" 20210216
          "C75FC2ED051A76658F80931E52317DB5DF09B3FDA325B6249A75DFDE53004A71B04AD5CF03BF51FE5AF279B09CC42AC4" 20210204
          "C1899C0F8DB18902234A86F529C0626B210BF79926E11AF871EA492D6C5D7D13AC0F3E58CC6361FB9B9915A7AFEA4782" 20210204
          "E5469BFD2C1824630A3D8F3CC20C42143E890FA52190A43F24B7B69B29F1B650BC9AA04C67747430553826D26B25C79A" 20210204
          end

          Here is an example of my using file "export_contractos.dta":


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str96 id long sp_zipcode str5 tariff_ekon_id
          "988383C055292068FEC5A64831BB101AB2D83CB35707D584B81525FFC8FE936847870F57B9AC7240BE31F12644B87538" 13300 "20TD"
          "F48AD271B727894C6958C4F0C62023AB7B1B7BE91E0041B7F37D8F2819643F6DD2F4FEA0A71A491B2E098C474F9C6306" 41020 "20TD"
          "BB5116E205F0177339A289DCC28ECBE60576B914D82A35BFA237543C6DDF52665878E3A99B6EBCC7FFE88BF83708F8BA"  4117 "20TD"
          "AD4879F418843A5363B17B64708DDF87B81D2D4F0C86C7C5B2B557A8BA83C245770DFF26B37058BE6FA89F7CEC8AF2C4" 41388 "20TD"
          "0651254E8A17363FED59FAB8515603036111DE8C09156DF46A138218B16BDB2A91DA6CE52F34C4DF530EA0B4914262FC" 43569 "20TD"
          end
          But I am also afraid that some typos could existing in these long strings (some extra spaces at the beginning or at the end, for example).


          Best,

          Michael
          Last edited by Michael Duarte Goncalves; 31 Oct 2023, 08:24.

          Comment


          • #6
            The examples you give don't show any spaces at all, although naturally that says nothing about the possibility of spaces elsewhere in your datasets.

            I don't understand the typo question otherwise. If some people have to type these account numbers into a system, the scope for typos seems enormous, but is anyone expected to do that? Otherwise the whole point of long and complicated identifiers is usually to guarantee uniqueness and correctness when computed. In other words they should be produced automatically.

            Otherwise my contribution to your question is having written the duplicates command back in history.
            Last edited by Nick Cox; 31 Oct 2023, 09:13.

            Comment


            • #7
              Hi again Nick,

              Thank you very much for your feedback.
              I always doubt myself when I code in Stata and wonder if I've made mistakes...It's a good thing this platform exists, at least to minimise errors.
              So it would seem that the data could potentially contain a data encryption problem because when I do a merge, nothing is merged.

              Once again, as usual, I'd like to thank you for your comments and help, which are always welcome. I hope one day to reach your level of excellence, but the road is long and winding...

              My problem is now solved.
              Have a nice rest of the day.

              Michael
              Last edited by Michael Duarte Goncalves; 31 Oct 2023, 09:58.

              Comment

              Working...
              X