Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating missing observations with the next available observation

    Hello,

    here is a small part of my dataset. The full dataset is about 50million observations:
    You can see that some observations are missing due to holidays, weekends etc..

    LPERMNO datadate ajexdi prccd
    10001 03jan1989 3 6.313
    10001 04jan1989 3 6.25
    10001 05jan1989 3 6.5
    10001 06jan1989 3 6.125
    10001 09jan1989 3 6.5
    10001 10jan1989 3 6.5
    10001 11jan1989 3 6.25
    10001 12jan1989 3 6.25
    10001 13jan1989 3 6.5
    10001 16jan1989 3 6.5
    10001 17jan1989 3 6.25
    10001 18jan1989 3 6.5
    10001 19jan1989 3 6.375
    10001 20jan1989 3 6.25
    10001 23jan1989 3 6.5
    10001 24jan1989 3 6.25


    Now i want this dataset to merge with another dataset by the unique identifier (LPERMNO) and by the date (datadate).
    Now in my second dataset, there is always one date per year per LPERMNO.
    The issue is that this date is sometimes on a holiday or weekend so i cant merge it fully with this dataset.


    In this dataset i would like to generate new observations for the missing ones and just take the values from the next available date per LPERMNO.
    So for example i would want the missing values 21jan1989 and 22jan1989 to be generated and to take on the values for ajexdi and prccd from 23jan1989.

    So the newly, fully complete dataset would look like this:

    LPERMNO datadate ajexdi prccd
    10001 03jan1989 3 6.313
    10001 04jan1989 3 6.25
    10001 05jan1989 3 6.5
    10001 06jan1989 3 6.125
    10001 07jan1989 3 6.5
    10001 08jan1989 3 6.5

    10001 09jan1989 3 6.5
    10001 10jan1989 3 6.5
    10001 11jan1989 3 6.25
    10001 12jan1989 3 6.25
    10001 13jan1989 3 6.5
    10001 14jan1989 3 6.5
    10001 15jan1989 3 6.5

    10001 16jan1989 3 6.5
    10001 17jan1989 3 6.25
    10001 18jan1989 3 6.5
    10001 19jan1989 3 6.375
    10001 20jan1989 3 6.25
    10001 21jan1989 3 6.5
    10001 22jan1989 3 6.5

    10001 23jan1989 3 6.5
    10001 24jan1989 3 6.25

    To ensure that the newly generated observations are actually quite close to the next one and not years away (As this is stock data and they sometimes start trading again after stopping for years) i would also like to limit the "next available date" to be maximimum 30 days away from the missing date

    How do i accomplish this? Ideally with a command that doesnt take hours as the dataset is quite huge but it doesn't matter if it isn't possible otherwise. Feel free to ask any clarifying questions or you might have a better solution to my problem.

    Thanks
    Nick







  • #2
    In the future, when showing data examples, please use the -dataex- command to do so, as I do below. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Because you did not use -dataex-, I cannot tell if your datadate variable is a string or if it is a true Stata internal format numeric date variable. So the code begins by assuming it's a string and converting it. The code will not work with string date variables.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int lpermno str9 datadate byte ajexdi float prccd
    10001 "03jan1989" 3 6.313
    10001 "04jan1989" 3  6.25
    10001 "05jan1989" 3   6.5
    10001 "06jan1989" 3 6.125
    10001 "09jan1989" 3   6.5
    10001 "10jan1989" 3   6.5
    10001 "11jan1989" 3  6.25
    10001 "12jan1989" 3  6.25
    10001 "13jan1989" 3   6.5
    10001 "16jan1989" 3   6.5
    10001 "17jan1989" 3  6.25
    10001 "18jan1989" 3   6.5
    10001 "19jan1989" 3 6.375
    10001 "20jan1989" 3  6.25
    10001 "23jan1989" 3   6.5
    10001 "24jan1989" 3  6.25
    10001 "31dec2020" 3  6.25
    end
    //  CONVERT DATADATE TO STATA INTERNAL FORMAT DAILY DATE VARIABLE
    gen date = daily(datadate, "DMY")
    assert missing(date) == missing(datadate)
    format date %td
    drop datadate
    rename date datadate
    
    //  FILL IN THE MISSING DATES AND BACKFILL
    //  PRCCD, AJEXDI WITH NEXT OBS CARRIED BACKWARDS
    gen expander = datadate - datadate[_n-1]
    replace expander = 1 if expander > 30 // BUT NOT IF GAP > 30 DAYS
    expand expander
    by lpermno (datadate), sort: replace datadate = datadate[_n-1]+ 1 if expander > 1
    Note: the last observation in the example data here was added by me to test that the code will not fill in when the gap is greater than 30 days--there were no instances of that in the example data you supplied.
    Last edited by Clyde Schechter; 05 Apr 2023, 21:04.

    Comment


    • #3
      Thanks a lot! will keep this in mind when asking my next question.


      Just to make sure:
      With this code it isn't possible that that a new observation is generated if the next available date is further than 30 days away, right? Because when the expander is one, it doesnt expand.
      The expander is also negative for the very first observation for LPERMNO. These should also be replaced with 1, right? Although it looks like it does not have any impact.
      Thanks again!

      Comment


      • #4
        With this code it isn't possible that that a new observation is generated if the next available date is further than 30 days away, right? Because when the expander is one, it doesnt expand.
        Correct. This is what you said you wanted, right?

        The expander is also negative for the very first observation for LPERMNO. These should also be replaced with 1, right? Although it looks like it does not have any impact.
        In the first observation for an lpermno, the value of datadate[_n-1] is missing, and therefore expander = . When expander = ., no expansion of that observation occurs, so while it is not true that expander is "negative" you are correct in observing that the very first observation of LPERMNO is never expanded. That would have to be the case because there would be no evidence of a gap in the data prior to the first observation. (Or, otherwise put, there would be no way to know how long a gap there is.) So again, this is what you want, right?

        Comment


        • #5
          Yes thats exactly what i wanted! thanks again. Just wanted to make sure

          Comment

          Working...
          X