Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying missing Data & Interpolation of Panel Data

    Hello,

    I've created a dataset of 30 countries over 30 years for 29 different variables. For some of these countries, the 30-year time span for some countries has several missing data points and I was wondering how do I interpolate these missing observations for each country? Also, is there a command to identify which countries have completely or nearly empty sets for a specific variable? For example, if I had a dataset:
    country year selfemployed% FDI(inflow %)
    Bermuda 2001 3% 25.16%
    Bermuda 2002 . 23.40%
    Bermuda 2003 . 6%
    Bahamas 2001 . .
    Bahamas 2002 . 14%
    Bahamas 2003 . 20%
    Barbados 2001 14% 16%
    Barbados 2002 16.3% 5%
    Barbados 2003 16% .
    Is there a command that summarizes which countries have missing data among all variables? I would like to be able to tell which variables I should delete given that there are very few observations for multiple countries. I would also like to know how I would interpolate for instance the missing data point for the FDI in Bermuda for 2003.

    Thank you!

  • #2
    The -ipolate- command will do interpolation. See -help ipolate- for details. But give some serious thought to whether interpolation is a reasonable way to deal with missing data for variables like this. Even if the implicit assumption that these variables grow (or shrink) linearly over time is reasonable, any type of single imputation necessarily understates the variance, which, in turn leads to inflated estimates of regression coefficients. The treatment of missing data is complicated. There are no good solutions; the trick is to find the least bad solution for your situation. https://statisticalhorizons.com/wp-c...aterials-1.pdf will acquaint you with some of the approaches that can be used and their pros and cons.

    As for identifying countries with all values missing for a variable:

    Code:
    by country (variable_of_interest), sort: gen byte all_missing = missing(variable_of_interest[1])
    Replace variable_of_interest by the name of the actual variable you want to check. Also note that this code works only if variable_of_interest is a numeric variable. The code creates the variable all_missing, which will be 1 in every observation for each country that has completely missing data for variable_of_interest, and 0 for every observation of each country that has any non-missing data for it. You can get a listing of the countries with all missing data by following that by -tab country if all_missing-.

    As for "nearly empty" or "very few observations" I'm afraid you'll have to make that more precise.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      I first learned interpolation as trying to go beyond printed tables of logarithmic or other functions. Here the underlying function varies not just smoothly but according to mathematical rule. Also, printed tables were, or should have been, detailed enough that linear interpolation should have worked well over very short intervals. (If not, there are several fancier methods, although by some weird division of labour, they almost never appear in statistics texts or courses.) This small skill of interpolating (mentally!) has long since been obsolete, like being able to use a slide rule or punch (and correct) 80-column cards or paper tape for computer input. Nostalgia's not what it used to be.

      When applied to data, the uses of interpolation extended to

      * Occasionally, interpolating from irregularly spaced measurements to a regular grid when some method demands the latter. People usually didn't try to estimate more than about as many data points than are in the data.

      * Occasionally, filling in small gaps when the underlying variable can be thought to change smoothly, ideally for physical reasons, although clearly there can be exceptions with social or medical data. If someone was 42 in 2012 and 46 in 2016 but their age was missing in between, we are on safe grounds and can and should interpolate. Interpolation might be not too bad an idea for say daily temperatures but it would often be a bad idea even for rainfalls. (In fact, with rainfalls missings would often really mean zero.)

      As Clyde points out, interpolation is often dubious otherwise, and I imagine he would agree that extrapolation is often even worse.

      In #1 FDI in Bermuda for 2003 is present and for all I know correct. It's perhaps sufficient to note that linear extrapolation would give -6% for FDI in Barbados in 2003. Such a variable I take to be so volatile that the practical choices are just ignoring very gappy series (with inevitably a sampling bias thereby) and just using what you have.

      A salutary test for interpolation is to take what there is, temporarily change yet more of the data to missings, and then see how good interpolation is at restoring what you know. Even more obvious is to plot the data to be confident that interpolation really is working well -- or not to do it otherwise.

      Comment

      Working...
      X