Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifier with huge codes

    Hallo everybody and thanks in advance for any kind of suggestion.
    I am working on a large dataset about investors in mutual funds (more than 10 millions obs); every person in my dataset is currently identified by an ID code (numeric) built merging asset manager code, customer code and year code. This means that the usual shape is like 370000002973312003 (firm 37, customer 297331, year 2003): i do not have the single parts of the code, but only the whole code.
    Now, the issue is that i have several observations for each person in every year, but i need to work with a typical panel setup. Usually i solve the issue using a tag command that gives me a unique identifier of each person-year combination; this time, however, i fail in using this shortcut... In particular the issue seems to be the way Stata uses to store this ID code; when i use concatenate or destring commands i obtain strange results. I was wondering if there is a way to re-code all the customers in a more convenient way through a single command.
    Thanks
    Simone

  • #2
    If I understood right, you may type something like:

    Code:
    set obs 1
    input str40 myvar
    370000002973312003
    gen year = substr(myvar, -4, 4)
    gen id = substr(myvar, -10, 6)
    gen firm = substr(myvar, 1,2)
    list
    Best regards,

    Marcos

    Comment


    • #3
      Thank you very much Marcos
      I tried substr option this morning after some trials with concatenate commands, but it hadly works. In effect, the original ID is double type (not a string): i see 370000002973312003 in edit page, but the stored value (when i go on the single cell) is 3.70000003e+13.
      I managed to obtain a FIRM-CUSTOMER ID without the year code (i.e. 37000000297331 in my previous example), but again the stored value (when i go on the single cell) is 3.70000003e+13.
      I guess that this one is the major issue... E.g. when i create a test variable xyz = ID - 1, (where ID is 37000000297331) in edit page I do not see 37000000297330, but 36999998210048.
      It feels a bit strange and frustrating...
      I remember an old post by Nic who said that long ID are an issue...


      Comment


      • #4
        It sounds like you may have data that is encoded.
        The best way to fix this is if you coudl post a data example with dataex, please the FAQ on how and why: https://www.statalist.org/forums/help#stata
        Or just type
        Code:
        dataex ID in 1/20

        Comment


        • #5
          Thanks Jorrit
          It is true... i sometimes forgot the basic rules of this forum :-)

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input double id
          37000000297331
          37000000492796
          37000000492796
          37000000494167
          37000000558933
          37000000558933
          37000000558933
          37000000760858
          37000000907153
          37000000907153
          37000000907153
          37000000963429
          37000000963429
          37000000963882
          37000001033722
          37000001041925
          37000001041925
          37000001050141
          37000001050141
          37000001069203
          end
          Code:
          -----------------------------------------------------------------------------------------------------------------------------------------
                        storage  display     value
          variable name   type   format      label      variable label
          -----------------------------------------------------------------------------------------------------------------------------------------
          id              double %18.0g            
          year            int    %8.0g
          This is the shape of my data...
          Thanks again

          Comment


          • #6
            Code:
            tostring id, gen(sid) format(%18.0g)
            Then follow Marcos Almeida 's advice in #2. Your example data does not follow the description in #1, e.g., the id ends with 4 digits identifying the year.

            Comment


            • #7
              Thanks Andrew
              the example followed my last message: i managed to obtain two separate parts of the code from the original source.
              I tried your and Marcos' hints and they worked fine; there is a final issue with customer code (it is still quite large, since i have more than 10 millions of persons).
              To create the unique identifier i need at least 2+8+2=14 digits, and Stata translate every number of this kind in exponential form... So I go back to the beginning of the problem (sigh).

              Thanks again for all your support!!!
              Simone

              Comment


              • #8
                You may wish to read this thread.
                Best regards,

                Marcos

                Comment


                • #9
                  The best way to handle this as described in Marcos's helpful link is to leave the id variable as a string. However, since you need to create a panel data set, you need a numeric identifier to -xtset- the data. 10 million is 8 digits, so there are efficiency gains in creating a new numeric id variable. -encode- will not work here as it has observation limits, but you can use the -group- function of egen instead.

                  Code:
                  tostring id, gen(sid) format(%18.0g)
                  egen double newid= group(sid)
                  However, I wonder how many hours it will take you to run panel regressions with 10 million observations!

                  Comment


                  • #10
                    Thanks a lot everybody... problem solved!
                    It is true, Andrew; my old laptop struggles also when i ask to make a summarize... and the original raw sample was 180 millions obs lol.
                    I will book a weekend to spend abroad while performing the estimations
                    Thanks again and have a lovely week

                    Comment

                    Working...
                    X