Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifier for Panel Data set and adding new data

    Hi all,

    I have a panel data set that is fairly large (70 quarters now) and I have to add new quarters to this file as they come out. I have created a panel id variable by using the following code,
    Code:
    egen double cidnew = group(cid)
    where cid is a unique identifier in the data set. My issue now is that as far as I have figured out the only way to continue matching the panel id is to drop the cidnew variable and run the previous code again after appending the new quarter data. The only issue I have with that is that it is a very slow process given the file size, and I am hoping to expedite the process. I have considered trying combine replace with some if statements to match the panel id up with the new cid's each quarter, but the issue with that is there are not a consistent number of cid's in every quarter, meaning that some leave and some join in a given quarter, and can always come back in a later quarter if gone in the newest quarter. So I am wondering if anyone has any ideas on how I could possibly just add to the panel id variable (cidnew), without dropping cidnew and rerunning the same code. Also the reason I am using cidnew and not the cid for the panel id, is that when declaring the data set to be panel it can't take cid as the panel id variable since it is an alpha-numeric string, that if converted to all numeric by simply replacing the letters with numbers, it is no longer a unique identifier.

    Thanks in advance for any help.

  • #2
    This is a bit confusing. Long story too.
    Are you saying that you have tried replacing cidnew for the appended data and 1) thats is slow too, so you'd like a different method, or 2) your tried but coudlnt figure out how?

    The only alternative I could think of is that you'd create a separate dataset that has info only on cid and cidnew. You could merge that with the new quarterly data, and then append the new quarter to the master database.

    Comment


    • #3
      I tried replacing cidnew for the appended data and couldn't figure out a good way to do that. The merging idea would work, but for new cid's that pop up in a given quarter, I would have to assign a cidnew that is unique, which is where I would run into an issue. Also the current file is around 178 GB, which is the reason this isn't a quick process and I am looking for a more efficient way to do this.

      Comment


      • #4
        Replacing the missing cidnew for companies already in the master dataset is something like:
        Code:
        bysort cid: replace cidnew=cidnew[1] if cidnew==.
        I add the 'if cidnew==.' here because it might make the operation more quick. The resulting values would be the same without it.

        You're still going to be stuck with all missing values for the cid that appear in the new data but not in the master dataset.

        If and when these newly added cid's only have a single observation in the new using dataset, you could add cidnew for these by doing:

        Code:
        replace cidnew=cidnew[_n-1] if cidnew==.
        If it is more elaborate, please give further details

        Comment

        Working...
        X