Missing Values-- A Lot of Them

Monica White

Join Date: Jan 2017

Posts: 98
#1

Missing Values-- A Lot of Them

21 Feb 2017, 17:53

Variable // Missing // Total // Percent Missing

----------------+-----------------------------------------------

exchange // 243,242 // 518,480 // 46.91

colonial // 838 // 518,480 // 0.16

rivalry_code // 113,166 // 518,480 // 21.83

alliance // 38,283 // 518,480 // 7.38

diplomacy // 445,518 // 518,480 // 85.93

no. of borders // 718 // 518,480 // 0.14

contiguity // 262,538 // 518,480 // 50.64

religion // 408,850 // 518,480 // 78.86

conflict // 3,220 // 518,480 // 0.62

signatory // 1,940 // 518,480 // 0.37

election // 89,573 // 518,480 // 17.28

polity // 93,543 // 518,480 // 18.04

GDP // 42,506 // 518,480 // 8.20

dyad_conflict // 696 // 518,480 // 0.13

asylum_rate // 483,615 // 518,480 // 93.28 !!!

----------------+-----------------------------------------------

The unit of analysis is directed-dyad year (Country A- Country B Year, Country B-Country A Year).

This includes values for dyads for all countries and years between 2000-2013.

However, some of the independent variables cut off in 2009 (exchange), 2010 (rivalry). Contiguity aka"borders the country in the dyad" ends in 2006, and thus are missing (However, in this time period most of the borders have not changed with the exception of a few nations, so there maybe a way around it).

The diplomacy and religion variables only have values for every half decade, and I'm not sure if interpolating will work, except maybe for the religion variable (diplomacy is a dummy if a diplomat from a given country visited a country, and religion includes percentages, but the dataset also has population numbers). There are not too many variables that have values until 2013 except for the DV.

As if that were not bad enough, the asylum rate is the DV.

The DV, regards granting asylum to migrants from a sending state. The asylum is granted/denied in the host state. This explains the large percentage of missing values, as you do not have individuals from every single country applying for asylum at the first instance in a given host state in a given year. However, there is still an astronomically large amount of missing values. Because it is a rate, I cannot fill in the missing values with zeros (something that other researchers have done with other migration variables when the unit of analysis is directed dyad year).

I have seen individuals use asylum data, but only in regions, or within a few cases. Even more unsettling, it is for my thesis. Do I totally have to scrap this dependent variable? I could *maybe* fill in the gaps if I add asylum applications that were appealed, but I would much rather keep it "clean" and only input first instances applications. However, right now it looks like it has to be done away with...
Tags: None
Monica White

Join Date: Jan 2017

Posts: 98
#2

21 Feb 2017, 18:45

Also, my understanding is that this may be a terrible case of missing not at random (MNAR).
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

21 Feb 2017, 18:54

It seems to me that the observations in which your dependent variable is missing tell you nothing about the responsiveness of the host state to applicants from the sending state. To me, you have 34,864 observations that are potentially of use for your modeling.
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#4

22 Feb 2017, 09:02

Originally posted by William Lisowski View Post

It seems to me that the observations in which your dependent variable is missing tell you nothing about the responsiveness of the host state to applicants from the sending state. To me, you have 34,864 observations that are potentially of use for your modeling.

Hi Will,

Correct, but once I run a regression I get

Code:

Number of obs = 4041

, and if I lag the DV, then there are insufficient observations to run the regression.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

22 Feb 2017, 09:40

My implication was that you should create a dataset with the subset of your data for which your dependent variable is nonmissing and then explore the pattern of missing values within that subset, as you did for the whole dataset in post #1. Clear you will still have problems with missing values in that subset that you will need to understand, and the analysis in post #1 tells you nothing about missing values in the subset of meaningful observations where a rate could be computed because someone from country A sought asylum in country B.
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#6

22 Feb 2017, 10:02

Originally posted by William Lisowski View Post

My implication was that you should create a dataset with the subset of your data for which your dependent variable is nonmissing and then explore the pattern of missing values within that subset, as you did for the whole dataset in post #1. Clear you will still have problems with missing values in that subset that you will need to understand, and the analysis in post #1 tells you nothing about missing values in the subset of meaningful observations where a rate could be computed because someone from country A sought asylum in country B.

Thanks for your reply, Will. But I'm a bit confused. I may just be overthinking things, though.

You said, "...create a dataset with the subset of the data for which your DV is nonmissing..." So, by this you mean, include in the dataset only observations in the dyad where there are values for the DV are there? So, drop any observations in the DV that are missing? Then do the following:

Variable // Missing // Total //Percent Missing

exchange// 11,803 //34,865 //33.85

colonial// 91 // 34,865 //0.26

rivalry 8,431 // 34,865 //24.18

alliance// 3,009 // 34,865 // 8.63

diplomacy 30,451 // 34,865 // 87.34

no of borders// 82 // 34,865 // 0.24

contiguity// 18,475 // 34,865 // 52.99

religion// 27,814// 34,865 // 79.78

conflict //159 //34,865 // 0.46

signatory// 82 //34,865 //0.24

election//3,196 //34,865// 9.17

polity// 927 //34,865// 2.66

gdp// 1,408 //34,865 //4.04

dyad conflict// 82// 34,865// 0.24

asylum_rate// 0 // 34,865 0.00

Did I at least understand that part?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

23 Feb 2017, 15:15

First, apologies for my delay in responding.

Yes, you understood that part of my advice. What I was thinking was that, for the modeling approach you apparently propose, observations where the asylum rate is missing are those where the denominator would be zero - and you learn nothing about the propensity to grant or deny asylum in country B to migrants from country A if nobody migrated from country A to country B.

OK, though, your comment about "not missing at random" suggests that perhaps if country B is particularly inhospitable to possible migrants from country A, nobody in A will try and the migration rate will be missing. One way around this is to construct a "two-part" or "hurdle" model (see the output of search hurdle in Stata). The idea is that you build one model of whether there were any migrants, and a second model of the asylum rate when there were migrants. In other words, there's an initial "hurdle" that has to be crossed (in this case to migrate from A to B) before an application for asylum can be submitted.

Hurdle models are fairly complicated; I don't know if the Stata commands work well with panel data (repeated measures of the same directed dyads, and for that matter, there's got to be some correlation (perhaps negative) between the results for migrants from A to B and from B to A) and I think it would be a good first step to proceed with a model of asylum rate when it is not missing. Because that will help you think through your data issues.

However, some of the independent variables cut off in 2009 (exchange), 2010 (rivalry). Contiguity aka "borders the country in the dyad" ends in 2006, and thus are missing (However, in this time period most of the borders have not changed with the exception of a few nations, so there maybe a way around it).

The diplomacy and religion variables only have values for every half decade, and I'm not sure if interpolating will work, except maybe for the religion variable (diplomacy is a dummy if a diplomat from a given country visited a country, and religion includes percentages, but the dataset also has population numbers). There are not too many variables that have values until 2013 except for the DV.

I agree that interpolating the religion variable (or variables, you say it includes percentages, but not of what) might be reasonable. I think you are going to have to give up on the diplomacy variable. I think carrying the contiguity data forward from 2005 and then replacing values as needed due to border changes will rescue that. If exchange and rivalry are important to your analysis, then you'll have to limit yourself to 2000-2009.

So ... your first step is the painful process of doing what you need to do to construct an analysis dataset taking care of the missing data one way (fixing it) or another (deleting those observations). Welcome to the world of real data, which is never like the nicely populated datasets you get in a "Quantitative Methods in [insert discipline here] 101" class. I cannot overstress the importance of this effort. In solving these problems, you are likely to learn more about your data just through the process of working with it.

Others here may have more to say, and certainly will correct me on anything I got wrong. Wow. Challenging stuff you're trying.

By the way, it would have been a good idea to have used CODE delimiters when posting the output you displayed in posts #1 and #4 rather than the ad hoc display you resorted to. You've done that in the past, and for that matter, in post #4 here. You'll find that the more you help others understand your problem, the more likely others are want to help you solve your problem. And a clear presentation of data, code, or results is a particularly valued way of helping others.
1 like
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#8

24 Feb 2017, 12:09

Hello William,

No need to apologize. Thank you for the detailed reply. It was very helpful, and it looks like I will have a busy weekend (and probably some interpolating questions to follow later!).

Regards,
MW
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#9

26 Feb 2017, 18:26

Originally posted by William Lisowski View Post

First, apologies for my delay in responding.

Yes, you understood that part of my advice. What I was thinking was that, for the modeling approach you apparently propose, observations where the asylum rate is missing are those where the denominator would be zero - and you learn nothing about the propensity to grant or deny asylum in country B to migrants from country A if nobody migrated from country A to country B.

OK, though, your comment about "not missing at random" suggests that perhaps if country B is particularly inhospitable to possible migrants from country A, nobody in A will try and the migration rate will be missing. One way around this is to construct a "two-part" or "hurdle" model (see the output of search hurdle in Stata). The idea is that you build one model of whether there were any migrants, and a second model of the asylum rate when there were migrants. In other words, there's an initial "hurdle" that has to be crossed (in this case to migrate from A to B) before an application for asylum can be submitted.

.

Hi Will,

As I was going over this post this evening, I forgot to mention that I believe I was wrong with my assumption with the missing asylum numbers. It could very well be that for some countries, no one from country A applied for asylum in country B. However, I did notice that there were some missing values even for industrial countries, (in other words, missing for a year at random, well, seemingly random) and thus it might actually be missing at random (MAR). Instead of using a hurdle model would this be a case of using weights? Specifically inverse probability weights? I honestly just came across it during the literature on Saturday, so I really have no idea. What I remember was that you make a logit model for missing and nonmissingness for each subject at each point in time. And that's some how used to generate weights to the inverse probability of nonmissingness. Unfortunately, while I can repeat what I read, it's all Greek to me.

By the way, thanks for the advice again. Managed to save some of the variables. But this DV is quite the pain.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

27 Feb 2017, 05:36

As I consider this problem further, I have the following question.

Does your data only have the asylum rate - the ratio of granted to (granted+denied)? Do you have (or can you obtain) the granted and denied counts (or granted and applied, where applied = granted+denied)) separately?

Clearly I am hoping the answer is the latter.
Comment

Monica White

Join Date: Jan 2017
Posts: 98

#11

27 Feb 2017, 06:46

Originally posted by William Lisowski View Post

As I consider this problem further, I have the following question.

Does your data only have the asylum rate - the ratio of granted to (granted+denied)? Do you have (or can you obtain) the granted and denied counts (or granted and applied, where applied = granted+denied)) separately?

Clearly I am hoping the answer is the latter.

The data that I obtained doesn't have the rate per se, because I was the one that calculated it.

The actual data itself has number that applied for the year, denied, applied, pending. But it seems that some of those numbers are missing at random.

here's an example/snippet of the raw data

Code:

input str37 asylum str32 origin float year int totalpersonspendingstartyear long appliedduringyear int(recognized decisions_other rejected otherwiseclosed totaldecisions) long totalpersonspendingendyear

"Afghanistan" "Iran"       2000  0   3  1 0  1  1   3  0
"Afghanistan" "Iran"       2001  0 110 21 0 68 21 110  0
"Afghanistan" "Iran"       2002  0  25  0 0 13  0  13 12
"Afghanistan" "Iran"       2003 22  32  4 0  1  8  13 41
"Afghanistan" "Iran"       2004 41  59 11 0 25 36  72 28
"Afghanistan" "Iran"       2005 25   9  3 0  2 24  29  5
"Afghanistan" "Iran"       2006  5   9  2 0  0  7   9  5
"Afghanistan" "Iran"       2007  5   7  6 0  0  4  10  2
"Afghanistan" "Iran"       2008  2   5  0 0  0  3   3  4
"Afghanistan" "Iran"       2009  4  12  0 0  3  4   7  9
"Afghanistan" "Iran"       2010  9  17  0 0  2  3   5 21
"Afghanistan" "Iran"       2011  .   .  . .  .  .   .  .
"Afghanistan" "Iran"       2012  .   .  . .  .  .   .  .
"Afghanistan" "Iran"       2013 12   7  1 0  2  0   3 16

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

27 Feb 2017, 09:32

In creating an asylum_rate you have implicitly chosen how the analysis should be done, and that may not have been the most appropriate choice. For example, while both 3 out of 4 and 75 out of 100 yield a rate of 0.75, they convey a very different level of precision. And because you can't define a rate for 0 out of 0, you wind up dealing with a missing value rather than the underlying fact that the denominator of the rate (applications? applications+pending?) is zero in that dyad for that year, which is actual information. I recommend thinking about your problem in terms of the basic data that you just presented. Of course what would be ideal is individual-level data, with the country dyad, year of application, year of disposition, and final disposition (recognized, rejected, other decision, other closure) but I assume your data source consists of summarized data at the dyad-year level. If indeed you have individual-level data, stop reading now and return to the more detailed data.

At this point, we're far far from my level of expertise, given whatever I have that passes for expertise. And we're far from the subject on this topic, which is "Missing Values-- A Lot of Them". Statalist members who are interested in the sort of modeling you will be doing have probably not followed this discussion about missing values. I suggest you post a new topic with a subject much more descriptive of the current problem - something like "modeling counts in panel data" - with an explanation that the units in your panel are defined by the pair (origin, asylum) and you want to model the probability that the asylum country recognizes a claim originated by someone from the origin country. (That could all be stated much better.) In discussing the problem, it will be important to mention your concerns about missing values, and the possibility of correlation between the number of applicants and the likelihood of acceptance.

In creating a new topic, you'll want to include sample data to make it concrete to the reader.

What you posted above is a good example of your data for a relatively "clean" dyad, which generally has a nonzero number of applicants during each year. It would be helpful to include a second dyad that has a number of years for which the applicants (or whatever you used as a denominator in generating your rates) is zero. You can generate the snippet of data including both dyads using a single dataex command, in what follows supposing the second dyad is Fenwick and Florin:

Code:

dataex asylum origin year totalpersonspendingstartyear appliedduringyear /// recognized decisions_other rejected otherwiseclosed /// totaldecisions totalpersonspendingendyear[ /// if (asylum=="Afghanistan" & origin=="Iran") | (asylum=="Fenwick" & origin=="Florin")

Note that as is always the case in Stata, the line continuation characters (///) will only work when the four-line command is run from a do-file or from the do-file editor window. If you enter a long command into the command window, do not attempt to break it into separate lines. By preparing your dataex this way, you will be able to include everything between the [code] and [/code] in your post.

Wow. I'm sure there's more to say, but I've run out of words at the moment. But I think you should be able to come up with a reasonable post on a new topic that might get the attention of the experts on this list.
1 like
Comment

Announcement