Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data cleaning: getting rid of trailing constant cells

    Hi all!

    I have a dataset of monthly stock prices (489 variables, 466 observations). Please, see sample data below.

    The thing is that not all stocks are "alive" until the end of the dataset (corporate mergers, delistings, etc.). My data however keeps the price of these stocks (variables) constant after they have delisted, which messes with my analysis.

    Is there a way to make all these "trailing constants" into missing values in Stata?

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 name int(wartsila11f utdpapermills)
    "1.1.1988"    .   .
    "1.2.1988"    .   .
    "1.3.1988"    .   .
    "1.4.1988"  231 142
    "1.5.1988"  226 142
    "1.6.1988"  229 163
    "1.7.1988"  219 170
    "1.8.1988"  222 182
    "1.9.1988"  195 168
    "1.10.1988" 190 160
    "1.11.1988" 187 169
    "1.12.1988" 197 169
    "1.1.1989"  211 165
    "1.2.1989"  239 173
    "1.3.1989"  231 173
    "1.4.1989"  320 178
    "1.5.1989"  326 179
    "1.6.1989"  350 170
    "1.7.1989"  316 171
    "1.8.1989"  340 171
    "1.9.1989"  348 167
    "1.10.1989" 370 153
    "1.11.1989" 348 149
    "1.12.1989" 345 138
    "1.1.1990"  380 142
    "1.2.1990"  365 220
    "1.3.1990"  455 220
    "1.4.1990"  435 180
    "1.5.1990"  426 145
    "1.6.1990"  440 167
    "1.7.1990"  430 132
    "1.8.1990"  420 146
    "1.9.1990"  405 138
    "1.10.1990" 375 140
    "1.11.1990" 365 130
    "1.12.1990" 382 126
    "1.1.1991"  387 136
    "1.2.1991"  376 136
    "1.3.1991"  376 136
    "1.4.1991"  376 136
    "1.5.1991"  376 136
    "1.6.1991"  376 136
    "1.7.1991"  376 136
    "1.8.1991"  376 136
    "1.9.1991"  376 136
    "1.10.1991" 376 136
    "1.11.1991" 376 136
    "1.12.1991" 376 136
    "1.1.1992"  376 136
    "1.2.1992"  376 136
    "1.3.1992"  376 136
    "1.4.1992"  376 136
    "1.5.1992"  376 136
    "1.6.1992"  376 136
    "1.7.1992"  376 136
    "1.8.1992"  376 136
    "1.9.1992"  376 136
    "1.10.1992" 376 136
    end

  • #2
    Code:
    gen date = daily(name, "DMY")
    assert missing(date) == missing(name)
    format date %td
    drop name
    
    foreach v of varlist wartsila11f utdpapermills {
        sort date
        gen int run = sum(`v' != `v'[_n-1])
        by run (date), sort: gen run_length = _N
        local n_runs = run[_N]
        by run (date): replace `v' = . if run == `n_runs' & _n > 1
        drop run*
    }
    Not that it matters substantively, but name is a strange choice of name for a date variable.

    Added: Of course, this just cuts off any tail that is a repetition. If the price is really the same for some period of time before delisting, then those repetitions of the same price, although they are real, will also be lost. I don't think there is any way to avoid this problem, unless there is a date of delisting available in the data.
    Last edited by Clyde Schechter; 25 Apr 2023, 14:10.

    Comment


    • #3
      Perfect! Thank you very much!

      Regarding your point in "Added" - yes that's a good point. It's something I'll just have to document and live with, since there's really no way to tell (though most likely the stock has delisted, if enough trailing days have the same price).

      Comment

      Working...
      X