Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • replace missing values in a given variable (e.g., X)with the first valid value in the same variable (X)

    Hi everyone,
    I was wondering to know how can I replace missing values within a variable (e.g., X) with the first valid value within the same variable (i.e., X).
    Thanks,
    Nader

  • #2
    It differs slightly, depending on whether X is a string or a numeric variable.
    Code:
    gen byte miss_X = missing(X)
    gen long obs_no = _n
    by miss_X( obs_no), sort, gen replacement = X[1] if !miss_X
    sort replacement
    replace X = replacement[1] if miss_X // USE THIS FOR NUMERIC VARIABLES ONLY
    replace X = replacement[_N] if miss_X // USE THIS FOR STRING VARIABLES ONLY
    In the future, so that those who want to help you don't have to guess about your data, or write code for multiple contingencies, be sure to show an example of your data when you ask for help with code. And always use the -dataex- command to show example data.

    If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      You are absolutely right. Following is my data (X) and what I am looking for (X_new).
      clear
      input county X X_new
      1 . 2
      1 . 2
      1 2 2
      1 . 2
      2 . 1
      2 . 1
      2 1 1
      end

      Comment


      • #4
        The example helps but doesn't answer the question of what you would do given two or more different values for X for each county,

        Comment


        • #5
          I wanna replace missing values in a variable (i.e., X) with the valid value within the same variable (X). My real data structure is much complicated that what I posted in my earlier email. I tried the following code in the real data, but it did not work.

          bysort county (age_group): replace med_`vvv' = med_`vvv'[_n+1]

          Comment


          • #6
            I believe that in #1 Mahdie said he or she wants the first of the non-missing values of X. That's why I coded #2 as I did.

            Added: Crossed with #5. My observation here (and the code in #2) seems inconsistent with the attempt at solving this that is shown in #5. But since Mathdie says that didn't work (whatever is meant by that!--please read the FAQ!!!!), perhaps I am still on track.
            Last edited by Clyde Schechter; 01 May 2018, 13:14.

            Comment


            • #7
              Yes, in my first post, I combined two questions. My next problem is replacing the missing values with the first valid value. Either one should have the same solution. I appreciate your advice.

              Comment


              • #8
                I don't actually see two questions in #1. I still read it as a request to replace missing values of X by the first non-missing value of X. The new wrinkle is that apparently this needs to be done within each county. That's just a minor change:
                Code:
                clear
                input county X X_new
                1 . 2
                1 . 2
                1 2 2
                1 . 2
                2 . 1
                2 . 1
                2 1 1
                end
                
                gen byte miss_X = missing(X)
                gen long obs_no = _n
                by county miss_X( obs_no), sort: gen replacement = X[1] if !miss_X
                by county (replacement), sort: replace X = replacement[1] if miss_X // USE THIS FOR NUMERIC VARIABLES ONLY
                sort obs_no
                The values of X at the end of this code will agree with the hand-worked calculations of X_new provided in #3.

                I'm still baffled by the code in #5, as I don't understand what the variable age_group, which doesn't even appear in the example, and has not been otherwise mentioned, has to do with it.

                Added: The example data has only a single non-missing value of X in each county, and that value appears in the final observation for the county. The code I give here does not rely on either of these being true. There can be any number of non-missing values of X, and they (it) can appear anywhere in the order: whichever one comes first will be used to fill in the missing values. This is what I understand is wanted. If it's not, then clarification is needed.

                Comment


                • #9
                  #5 does not answer #4. You're assuming, or implying, that there is at most one candidate for replacement value. Have you checked?

                  Also, we believe you that the real problem is much more complicated but as Clyde also points out in different but equivalent form we can't discern what those complications are if you don't tell us.

                  Here is conservative code that replaces missing values before the first non-missing value with that non-missing value and leaves all others unchanged. As Clyde does I have to construct an observation number, absent any information about a time or other individual identifier.


                  Code:
                  clear
                  input county X X_new
                  1 . 2
                  1 . 2
                  1 2 2
                  1 . 2
                  2 . 1
                  2 . 1
                  2 1 1
                  end
                  
                  gen long obs_no = _n 
                  
                  bysort county : egen first = min(cond(X < ., obs_no, .)) 
                  
                  by county: egen wanted = total(cond(obs_no == first, X, .)) if obs_no <= first 
                  
                  list, sepby(county) 
                  
                       +----------------------------------------------+
                       | county   X   X_new   obs_no   first   wanted |
                       |----------------------------------------------|
                    1. |      1   .       2        1       3        2 |
                    2. |      1   .       2        2       3        2 |
                    3. |      1   2       2        3       3        2 |
                    4. |      1   .       2        4       3        . |
                       |----------------------------------------------|
                    5. |      2   .       1        5       7        1 |
                    6. |      2   .       1        6       7        1 |
                    7. |      2   1       1        7       7        1 |
                       +----------------------------------------------+

                  Comment


                  • #10
                    Thanks Nick and Clyde.
                    Nick, I was wondering to know why case #4 still has a missing value?
                    Unfortunately, your code did not work thoroughly in my real data. Please let me give another example which reflects my real data better than previous one. I did not post this following example in my earlier posts because I thought the solution should be easy. I do apologize for any inconveniences.

                    clear
                    input county year sex Age_Group Median Median_new
                    1 2010 0 1 . 2
                    1 2010 1 2 . 2
                    1 2010 2 3 2 2
                    1 2015 0 4 . 3
                    1 2015 1 5 . 3
                    1 2015 2 6 3 3
                    2 2010 0 1 . 1
                    2 2010 1 2 1 1
                    2 2010 2 3 . 1
                    2 2015 0 4 . 5
                    2 2015 1 5 . 5
                    2 2015 2 6 5 5

                    end


                    I am looking for Median_new.



                    Last edited by Mahdie Raj; 01 May 2018, 15:02.

                    Comment


                    • #11
                      Code:
                      clear
                      input county year sex Age_Group Median Median_new
                      1 2010 0 1 . 2
                      1 2010 1 2 . 2
                      1 2010 2 3 2 2
                      1 2015 0 4 . 3
                      1 2015 1 5 . 3
                      1 2015 2 6 3 3
                      2 2010 0 1 . 1
                      2 2010 1 2 1 1
                      2 2010 2 3 . 1
                      2 2015 0 4 . 5
                      2 2015 1 5 . 5
                      2 2015 2 6 5 5
                      
                      end
                      
                      gen byte miss_Median = missing(Median)
                      by county year miss_Median( year), sort: gen replacement = Median[1] if !miss_Median
                      by county year (replacement), sort: replace Median = replacement[1] if miss_Median 
                      sort county year
                      
                      assert Median == Median_new
                      This works with your current example.

                      I did not post this following example in my earlier posts because I thought the solution should be easy.
                      And as a result, you, Nick, and I have all wasted a lot of time figuring out answers to the wrong questions. The fact of the matter is, if you don't know how to solve a problem yourself, then you don't actually know whether the solution will be simple or not. As you have seen, by posing an example that does not in fact reflect the problem you want to solve, nothing good is accomplished.

                      Always ask the real question and give data examples that represent your data well. Don't assume anything about the solution to a problem you haven't yourself solved.

                      Comment


                      • #12
                        I already answered the question in #10

                        I was wondering to know why case #4 still has a missing value?
                        with the explanation in #9

                        replaces missing values before the first non-missing value with that non-missing value and leaves all others unchanged.
                        I was underlining what Clyde has said repeatedly. You were not posing an unambiguous, sharp, clear question.

                        Further, Clyde underlined in #6 that "didn't work" by itself is not helpful as a report and reminded you to read the FAQ. You didn't do that, it seems, otherwise you would not have again made the same comment in #10.

                        https://www.statalist.org/forums/help#stata

                        Never say just that something "doesn't work" or "didn't work", but explain precisely in what sense you didn't get what you wanted.
                        Last edited by Nick Cox; 01 May 2018, 17:33.

                        Comment

                        Working...
                        X