Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Stata to identify data errors

    I have a large and messy dataset of the form
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 Product byte(Year FirmA FirmB FirmC)
    "P"  1  .  0 19
    "P"  2  . 26 12
    "P"  3  8 14 12
    "P"  4  . 12  0
    "P"  5  0 12  .
    "P"  6 12 33  .
    "P"  7 18 22  .
    "P"  8 12 13  .
    "P"  9 16 17  .
    "P" 10 12 13  .
    "P" 11  9 18  .
    "P" 12 10 15  .
    "P" 13 11 11  .
    "P" 14 12 27  .
    "P" 15 13 40  .
    "P" 16 11 31  .
    "P" 17 15 23  .
    "P" 18 17 30  .
    "P" 19 18 20  .
    "P" 20  8 21  .
    "P" 21 13 23  .
    "P" 22 17 25  .
    "P" 23 21 23  .
    "P" 24 27  .  .
    end
    ...in which there there are many more products and firms. I know from other information that Firm C discontinued production in year 4, so the subsequent null values are legitimate. But I also know that the nulls and zeroes for Firms A and B are data errors. Are there any strategies to identify the anomalous values such as by looking for nulls or zeroes that are adjacent to nonzero values? Perhaps by excluding data runs?

    TIA!

  • #2
    See this FAQ by Nick Cox and Vince Wiggins on how to identify runs in a time series.

    http://www.stata.com/support/faqs/da...-observations/


    Or alternatively, install tsspell from SSC and read the help files

    Code:
    ssc install tsspell
    help tsspell
    You then just have to check conditions such as

    Code:
    list if runA< 2 & FirmA==.
    list if runA< 2 & FirmA==0
    etc.

    Comment


    • #3
      Thank you!

      Comment

      Working...
      X