Replace missing value in one variable with value in another variable

Richard Williams

Join Date: Apr 2014

Posts: 4982
#16

01 Aug 2014, 09:26

Roberto, how did you do that? Did you re type the data or were you able to somehow cut and paste from Owen!s listing? It is great when people list their data, but it is even better if you can easily get the sample data into Stata.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#17

01 Aug 2014, 10:17

Originally posted by Richard Williams View Post

Roberto, how did you do that? Did you re type the data or were you able to somehow cut and paste from Owen!s listing?

Unfortunately, the old fashioned way: copy/paste/edit. With a good text editor and a bit of skill, a case like the previous one should be easy enough. I'm thinking about an editor like Vim or Emacs, for example.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#18

01 Aug 2014, 15:20

Thanks Clyde, I tried the code with modified variables names but have still had no luck. I think that is because the code is telling stata to replace any missing values in the first set of variables with data from the last three variables, this ensures that all data is complete for the first four variables however the fact that there are still empty cells in the final three variables means that stata is still not keen on letting me run a regression.

As Roberto pointed out, I need to do a lot more clarifying so I will try to better explain what I am doing:

I am trying to create a benchmark statistical model for forecasting the results of football matches -the purpose of this model is to enable me to test bookmakers' expertise by looking at whether they process information more or less efficiently compared to the benchmark statistical model.
The model is created in the footsteps of Professor John Goddard who utilised win ratios over the previous 24 months, and a number of other covariates, all of which were found to be statistically significant. Prof. Goddard uses the following terminology:
P_i,y,s^d: The win ratio for the games played by home team i, within:

y=0 i.e. 0-12 months from the current match
y=1 i.e. 12-24 months from the current match

s=0 i.e. matches from the current season
s=1 i.e. matches from one season ago
s=2 i.e. matches from two seasons ago.

d=0: i.e. for matches played within the current division
d=-1: i.e. for matches played within one division lower
d=-2: i.e. for matches played within two divisions lower
d=1:i.e for matches played within one division higher
d=2: i.e. for matches played within two divisions higher

Prof Goddard used a huge amount of data across 15 seasons of the English football leagues. I understand why it is important to account for the difference in the divisions but it is a lot of work and I thought it would be a bit too challenging to get the statistics right for this. So I decided to work with fewer seasons and tried to make my life a little bit easier by specifically choosing to focus on those teams who were in the Premier League division during all three seasons. I got the data for all the remaining variables that I am dealing with and finally got to computing the win ratios, but I discovered that two of my teams had been relegated for a period of 1 year - during the 2008/09 season [my focus is on matches from 2010/11, 2011/12 and 2012/13 seasons). Now when I compute these ratios for my observations from season 2010/11 I need to go back to as far as 2008/09 in order to compute the win ratio for the 12-24 months from the current match. Since two of my teams were in a lower division for that period, when computing their win ratios for that time frame I have to put them under a different variable, namely one that accounts for the fact that the data is computed based on performance in a lower league. Since in stata it is not possible to write the variable the way it is written above I employed the following terminology:

PH00d0 - where P refers to the win ratio, H refers to team i … the home team, 00 refers to y=0, s=0, and d0 refers to the ratios being for matches played in the current division.
The rest of the variables follow suit.
Then I have PH01d1 - where again P refers to the win ratio, H refers to team i … the home team, 01 refers to y=0, s=1, and d1 is supposed to be d=-1 but I cannot include the negative sign so I am just going ahead with d1 for now… it refers to the ratios being for matches played in one division lower.
Unlike in Prof. Goddard's study, I do not have any observations where the matches were played in two divisions lower, or one(or two) divisions higher. I am only dealing with d=0 and d=-1.

Looking back at the data posted in Roberto's posts with the table (sorry for the trouble Roberto), I re-edited the variable names - PH00d0 etc are very clear to me but I have been looking at this data for long enough, I thought by re-editing them it might make things clearer for you guys but of course that didn't work. So, I will just revert back to the actual names that I am working with. Anyhow, in the data posted each observation has 4 complete variables and 3 incomplete. The incomplete variables are not actually missing variables - they are empty because no matches were played in those particular conditions e.g. observation 17 … here the 3rd and 4th variable are empty because one season ago from the current match, that particular team was playing in a lower division, so the win ratios for that time frame have to be placed in columns 5 and 6 which contain d1 at the end to show that.

Now, I am wondering whether it would be accurate for me to place a zero for the empty cells for observation 17 - after all, no matches were played under the circumstances of variable 3,4 and 7. But I am worried that I am maybe over simplifying this matter through this approach.

At the end of the day, my goal is to estimate a coefficient for all of the 7 variables. I tried to post what the results table is supposed to look like earlier but I don't think that worked very well. It might be easier to refer to the following paper - Odds-setters as forecasters: The case of English Football (2005) by D. Forrest, J. Goddard and R. Simmons. [table 1 on page 556].

Thanks again and I hope this is a little bit clearer.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4982
#19

01 Aug 2014, 17:54

This is a tough problem! At least for me.

Anyway, you said every case has values for exactly 4 variables, right? So would it make sense just to make their 4 values correspond to x1-x4, i.e. cut this down to 4 variables with complete data rather than 7 variables where everybody has missing values? Perhaps include some kind of indicator of what time period the variable or variables is from?

By way of analogy -- In a ?aire, people may be asked if they are married. If yes, they may get asked about their relationship with their spouse; if no, they may skip to a question about whether they are cohabiting. If yes, they may get asked a virtually identical Q about their partner. Even though these may be Qs 7 and 22 in the ?aire, the researcher may wind up combining the two and treating them as a single question.

I don't know if that makes substantive sense or not. But it it did, it would certainly be easier to move four values into x1-x4 than it would be to figure out how to deal with v1-v7 when nobody has complete data on any of them.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#20

02 Aug 2014, 04:34

Thanks Richard. I know its a pretty tough thing to tackle, that's why I tried to completely avoid that problem but found myself in it anyway. I could completely eliminate the two teams that are causing this dilemma but then my observations shrink down to around 300 something, and with the amount of variables I have this really isn't a route I want to take.
As for combining the data from 7 variables into 4 - I believe Clyde suggested I do that with the code that he posted earlier. I agree that would be easier but the issue with that is that I would have my first four variables being estimated with data that does not belong to them, so my estimates would actually be inaccurate.

I have been thinking about your suggestion to include some kind of an indicator for the time period that the variable is from. I was first thinking of converting the 7 variables to dummy variables where 4 of the 7 variables will have a 1 and the other 3 will have a 0 .. and those with a 1 will be replaced by the actual win ratio computed for that variable. But would that be the same as simply replacing all empty cells with a 0? (i think both of these approached will will let me generate the 7 coefficient that I need, the second method is simpler but I am worried about its accuracy) If I place a zero in an empty cell, e.g. for variable PH01d1 … that would imply that a team did not win any games in the past 0-12 months, from games in the previous season, in one division lower. The issue I see with this is that this implication suggests that the team did play a game or more in the lower division but did not win any. However, if the cell for that variable was empty then it was empty because the team did not play any games in that division at all. So, my question is - if I used dummy variables first, and then added a command to replace 1 for the actual win ratios, would that resolve this matter? I am inclined to think that it would because 1 would imply that the team played games in 4 of the 7 circumstances represented by the 7 variables, and then replacing them with the actual win ratios will enable me to get the estimates for my coefficients. Where the 0s will be, will imply that the team did not play in such circumstances - which is what I am trying to say at the moment via the empty cells. What do you think?
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#21

02 Aug 2014, 05:03

It just occurred to me that using dummy variables and then commanding that stata replace the 1 with the actual win ratios would not work because I have different win ratios for different observations and stata would not be able to distinguish which one belongs to which observation because all of the non-empty cells would have 1 in them.

Back to square one. Unless I am wrong in thinking that replacing all empty cells with a zero is different to having the dummy variable which would also place a 0 wherever there is an empty cell. Could anyone shed some light on this? Would putting zeros in place of all the empty cells have an influence on the estimation of the coefficients for those variables? And result in biased outcomes?

Last edited by Owen Keating; 02 Aug 2014, 05:41.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4982
#22

02 Aug 2014, 07:21

I would have my first four variables being estimated with data that does not belong to them

I have to admit I have sort of lost track of everything. Hopefully Clyde will jump back in!

But I don't understand this. People have values on 4 of the 7 variables, right? These values really really do belong to them, right? So how do these values suddenly not belong to them if we shift everything over into the first 4 spots?

If you really want to go with a 7 variable strategy, check out p. 5 of http://www3.nd.edu/~rwilliam/xsoc63993/l12.pdf. As I understand it, values are missing, not because somebody failed to record them, but because they simply don't exist. So, a possible strategy is to fill in zeros for them, and then have dummy variable indicators that showed you did that. e.g. you would have md1 to md7, each one coded 1 if you plugged in a 0 and 0 otherwise. As my handout explains, this can be a bad strategy when values are missing because they weren't recorded, but can be a good strategy if the values never existed in the first place. (e.g. father's income is in the model but the father is dead).

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#23

02 Aug 2014, 10:50

Thank you Richard! Indeed the values are not missing, they simply do not exist. Basically for all of the observations we can compute the win ratios within 0-12months, in the current season i.e. variable PH00d0. However, when encountering a team that had been moved down by one division in the previous season, and we compute their win ratio for matches played within 0-12 months, in the previous season, it is no longer ok to place that win ratio under the variable P_{H,0,1^{0 <--- MS Word let me write the variable}}PH01d0 more appropriately, visually the MS Word's version of my variable is far more appealing so I will include it here)
because the team was in a lower league in the previous season. So you'd leave P_H,0,1⁰ (PH01d0) blank and would instead put your calculated win ratio under the variable P_H,0,1^-1 (PH01d1) But now you will find yourself with a variable that is blank, not because the data for it is missing but because the data you have does not fall in the specification of that variable. And this makes sense because you cannot have a win ratio for P_H,0,1⁰ (PH01d0) andP_{H,0,1^{-1 (}}PH01d1)because a team would have played in either the current division i.e….d0 or it would have played in a division lower i.e. ….d1, not in both.

I have looked at the document you proposed. It's brilliant! Exactly what I needed! In fact even better because it is basically directing me towards proof that this mechanism is an acceptable approach in cases where the values are not missing, they simply do not exist. Honestly, I cannot thank you enough!
I'll keep you posted on the outcomes of this approach, fingers crossed it all works out perfectly! =)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4982
#24

02 Aug 2014, 13:15

I am not so sure that this is the right way, but I will keep my fingers crossed too!

Maybe you said this earlier, but is there a separate dependent variable floating around somewhere? The approach I suggested only works for missing data on independent variables. If the 7 variables include your dependent variable, you still have the problem that the dependent variable will be missing for many cases.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#25

03 Aug 2014, 12:52

Hi Richard, it worked! As for the dependent variable - that's the football match outcome, so it is not one of the win ratio variables. The win ratio variables are all independent variables.

If you don't mind I have another question. I have made dummy variables for each of the win ratio variables that had empty cells. But when I run my regression, do I need to include these dummy variables?

Basically, with the dummies the regression would be along the following lines (for now I exclude the rest of the variables, otherwise it'll be very lengthy):
F =P_i,0,0⁰+ P_i,0,1⁰+P_i,1,1⁰+P_i,1,2^{0 +}P_i,0,1^-1+P_i,1,1^-1+P_i,1,2^-1
+ DP_i,0,1⁰+DP_i,1,1⁰+DP_i,1,2^{0 +}DP_i,0,1^-1+DP_i,1,1^-1+DP_i,1,2^-1

Now if you have an observation where the empty cells are for variables P_i,0,1⁰;P_i,1,1⁰; P_i,1,2^-1
then your regression would look like this:

F =P_i,0,0⁰+ 0.P_i,0,1⁰+0.P_i,1,1⁰+P_i,1,2^{0 +}P_i,0,1^-1+P_i,1,1^-1+0.P_i,1,2^-1
+ 1.DP_i,0,1⁰+1.DP_i,1,1⁰+0.DP_i,1,2^{0 +}0.DP_i,0,1^-1+0.DP_i,1,1^-1+1.DP_i,1,2^-1

That way the dummy variables indicate precisely which variables had empty cells that were then replaced by zeros and which weren't. That's also probably helpful because it is possible for a win ratio to have actually been computed and for it to be zero. The dummy variable makes it possible to distinguish between such cases and ones where the researcher has decided to replace an empty cell, where a value simply doesn't exist, with a zero.

Also, that way stata is able to estimate a coefficient for all the win ratio variables, just like the authors of the paper had done.

I would imagine the researchers who first did this would also have had to do something along these lines to tackle the very same issue. But the odd thing is, if they did do that then they should have obtained estimates for the coefficients of the Dummy variables as well but in their results table there is no sign of any dummy variables in relation to the above win ratio variables. Do you find that odd too?

What if the above strategy was simply utilised to ensure that there were no empty cells for the observations so that a regression could be carried out … but when conducting the regression, the newly generated dummy variables were not actually included in the regression? Do you think that's possible or is that not a correct mechanism that can be employed when doing a regression?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4982
#26

03 Aug 2014, 12:59

Do you mean the missing data dummies you computed for the cases where data was missing and you filled in a zero? If so, yes, they should be included in the regression. Otherwise there would be no point in computing them. If you mean something else, let me know.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#27

03 Aug 2014, 14:02

That's exactly what I meant. I ran the regression with them being included but now comes up the issue of perfect collinearity. The culprit most likely lies within the newly generate dummies. I'll try a couple of things and see if I can fix that.

Thank you once again!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4982
#28

03 Aug 2014, 14:10

You may not need all 7. These things come in blocks, right? so, if you are missing on x1, you will be missing on x2 and x3 too, right? So those three MD dummies would be perfectly correlated. Or something like that. So rather than have an MD variable for every observed variable, maybe you only need one for each block of variables,e.g. the x1-x3 block of missing, the x5-x7 block of missing, or whatever it is.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#29

03 Aug 2014, 16:01

I'm back with yet another question. 10 of my 12 dummy variables are being omitted due to collinearity. I tried to test for multicollinearity with the - vif - command but had no luck, for some reason it's not possible to do this after an oprobit regression. I've thought about the dummy variable trap but it doesn't quite apply here because the 12 dummy variables are not set up for 12 or 13 different categories … each dummy variable relates to a particular win ratio variable and indicates if that variable had any empty cells or not. So I don't think I am dealing with a dummy variable trap. Still I have tried dropping one of these dummy variables and have repeated the regression but no luck, stata still eliminates all the dummy variables except 2. Obtaining more data could be a possible solution as that can lead to much more precise estimates of the parameters but I have too many intricate variables, adding more observations would mean that I have to deal with all of these variables all over again and time wise I cannot afford to do that.
Any suggestions on what could be the problem?
Comment
Owen Keating

Join Date: Jul 2014

Posts: 15
#30

03 Aug 2014, 16:09

Richard I just noticed your post about the blocks, that's the first thing that should have occurred to me. Acting on it right away!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment