replace missing values in a given variable (e.g., X)with the first valid value in the same variable (X)

Mahdie Raj

Join Date: Feb 2016

Posts: 48
#1

replace missing values in a given variable (e.g., X)with the first valid value in the same variable (X)

01 May 2018, 12:22

Hi everyone,
I was wondering to know how can I replace missing values within a variable (e.g., X) with the first valid value within the same variable (i.e., X).
Thanks,
Nader
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#2

01 May 2018, 12:31

It differs slightly, depending on whether X is a string or a numeric variable.

Code:

gen byte miss_X = missing(X) gen long obs_no = _n by miss_X( obs_no), sort, gen replacement = X[1] if !miss_X sort replacement replace X = replacement[1] if miss_X // USE THIS FOR NUMERIC VARIABLES ONLY replace X = replacement[_N] if miss_X // USE THIS FOR STRING VARIABLES ONLY

In the future, so that those who want to help you don't have to guess about your data, or write code for multiple contingencies, be sure to show an example of your data when you ask for help with code. And always use the -dataex- command to show example data.

If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Mahdie Raj

Join Date: Feb 2016

Posts: 48
#3

01 May 2018, 12:49

You are absolutely right. Following is my data (X) and what I am looking for (X_new).
clear
input county X X_new
1 . 2
1 . 2
1 2 2
1 . 2
2 . 1
2 . 1
2 1 1
end
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35664
#4

01 May 2018, 13:06

The example helps but doesn't answer the question of what you would do given two or more different values for X for each county,
Comment
Mahdie Raj

Join Date: Feb 2016

Posts: 48
#5

01 May 2018, 13:11

I wanna replace missing values in a variable (i.e., X) with the valid value within the same variable (X). My real data structure is much complicated that what I posted in my earlier email. I tried the following code in the real data, but it did not work.

bysort county (age_group): replace med_`vvv' = med_`vvv'[_n+1]
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#6

01 May 2018, 13:12

I believe that in #1 Mahdie said he or she wants the first of the non-missing values of X. That's why I coded #2 as I did.

Added: Crossed with #5. My observation here (and the code in #2) seems inconsistent with the attempt at solving this that is shown in #5. But since Mathdie says that didn't work (whatever is meant by that!--please read the FAQ!!!!), perhaps I am still on track.

Last edited by Clyde Schechter; 01 May 2018, 13:14.
Comment
Mahdie Raj

Join Date: Feb 2016

Posts: 48
#7

01 May 2018, 13:16

Yes, in my first post, I combined two questions. My next problem is replacing the missing values with the first valid value. Either one should have the same solution. I appreciate your advice.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#8

01 May 2018, 13:37

I don't actually see two questions in #1. I still read it as a request to replace missing values of X by the first non-missing value of X. The new wrinkle is that apparently this needs to be done within each county. That's just a minor change:

Code:

clear input county X X_new 1 . 2 1 . 2 1 2 2 1 . 2 2 . 1 2 . 1 2 1 1 end gen byte miss_X = missing(X) gen long obs_no = _n by county miss_X( obs_no), sort: gen replacement = X[1] if !miss_X by county (replacement), sort: replace X = replacement[1] if miss_X // USE THIS FOR NUMERIC VARIABLES ONLY sort obs_no

The values of X at the end of this code will agree with the hand-worked calculations of X_new provided in #3.

I'm still baffled by the code in #5, as I don't understand what the variable age_group, which doesn't even appear in the example, and has not been otherwise mentioned, has to do with it.

Added: The example data has only a single non-missing value of X in each county, and that value appears in the final observation for the county. The code I give here does not rely on either of these being true. There can be any number of non-missing values of X, and they (it) can appear anywhere in the order: whichever one comes first will be used to fill in the missing values. This is what I understand is wanted. If it's not, then clarification is needed.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35664

01 May 2018, 13:53

#5 does not answer #4. You're assuming, or implying, that there is at most one candidate for replacement value. Have you checked?

Also, we believe you that the real problem is much more complicated but as Clyde also points out in different but equivalent form we can't discern what those complications are if you don't tell us.

Here is conservative code that replaces missing values before the first non-missing value with that non-missing value and leaves all others unchanged. As Clyde does I have to construct an observation number, absent any information about a time or other individual identifier.

Code:

clear
input county X X_new
1 . 2
1 . 2
1 2 2
1 . 2
2 . 1
2 . 1
2 1 1
end

gen long obs_no = _n 

bysort county : egen first = min(cond(X < ., obs_no, .)) 

by county: egen wanted = total(cond(obs_no == first, X, .)) if obs_no <= first 

list, sepby(county) 

     +----------------------------------------------+
     | county   X   X_new   obs_no   first   wanted |
     |----------------------------------------------|
  1. |      1   .       2        1       3        2 |
  2. |      1   .       2        2       3        2 |
  3. |      1   2       2        3       3        2 |
  4. |      1   .       2        4       3        . |
     |----------------------------------------------|
  5. |      2   .       1        5       7        1 |
  6. |      2   .       1        6       7        1 |
  7. |      2   1       1        7       7        1 |
     +----------------------------------------------+

Comment

Mahdie Raj

Join Date: Feb 2016

Posts: 48
#10

01 May 2018, 14:53

Thanks Nick and Clyde.
Nick, I was wondering to know why case #4 still has a missing value?
Unfortunately, your code did not work thoroughly in my real data. Please let me give another example which reflects my real data better than previous one. I did not post this following example in my earlier posts because I thought the solution should be easy. I do apologize for any inconveniences.

clear
input county year sex Age_Group Median Median_new
1 2010 0 1 . 2
1 2010 1 2 . 2
1 2010 2 3 2 2
1 2015 0 4 . 3
1 2015 1 5 . 3
1 2015 2 6 3 3
2 2010 0 1 . 1
2 2010 1 2 1 1
2 2010 2 3 . 1
2 2015 0 4 . 5
2 2015 1 5 . 5
2 2015 2 6 5 5

end

I am looking for Median_new.

Last edited by Mahdie Raj; 01 May 2018, 15:02.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#11

01 May 2018, 15:16

Code:

clear input county year sex Age_Group Median Median_new 1 2010 0 1 . 2 1 2010 1 2 . 2 1 2010 2 3 2 2 1 2015 0 4 . 3 1 2015 1 5 . 3 1 2015 2 6 3 3 2 2010 0 1 . 1 2 2010 1 2 1 1 2 2010 2 3 . 1 2 2015 0 4 . 5 2 2015 1 5 . 5 2 2015 2 6 5 5 end gen byte miss_Median = missing(Median) by county year miss_Median( year), sort: gen replacement = Median[1] if !miss_Median by county year (replacement), sort: replace Median = replacement[1] if miss_Median sort county year assert Median == Median_new

This works with your current example.

I did not post this following example in my earlier posts because I thought the solution should be easy.

And as a result, you, Nick, and I have all wasted a lot of time figuring out answers to the wrong questions. The fact of the matter is, if you don't know how to solve a problem yourself, then you don't actually know whether the solution will be simple or not. As you have seen, by posing an example that does not in fact reflect the problem you want to solve, nothing good is accomplished.

Always ask the real question and give data examples that represent your data well. Don't assume anything about the solution to a problem you haven't yourself solved.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35664
#12

01 May 2018, 16:56

I already answered the question in #10

I was wondering to know why case #4 still has a missing value?

with the explanation in #9

replaces missing values before the first non-missing value with that non-missing value and leaves all others unchanged.

I was underlining what Clyde has said repeatedly. You were not posing an unambiguous, sharp, clear question.

Further, Clyde underlined in #6 that "didn't work" by itself is not helpful as a report and reminded you to read the FAQ. You didn't do that, it seems, otherwise you would not have again made the same comment in #10.

https://www.statalist.org/forums/help#stata

Never say just that something "doesn't work" or "didn't work", but explain precisely in what sense you didn't get what you wanted.

Last edited by Nick Cox; 01 May 2018, 17:33.
1 like
Comment

Announcement