Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop variables for missing missing values

    Hi all,
    I hope this email finds you well during these hard times.

    I am quite new in this platform and don't know quite how to use it, but i have been reading and learning a lot from this platform, by reading your comments to students in solving stata problems. Therefore, i decided to write with the hope that you could help me in resolving once coding problem.

    I am trying to create a code which would drop the whole series of daily stock return observations (aaa bbb or ccc), if more than 50% of the daily returns are missing or are equal to 0 in any year (2010 or 2011 or 2012…). My data is consisted of daily stock returns for the period from january 2010 until may 2020.

    I tried one code that I found in the forum that suggested:

    glo p = 0.5
    * Loop over variables
    foreach var of varlist * {
    count if missing(`var')
    if (r(N)/_N) >= $p drop `var'
    }

    However this code deletes the variables that have more than 50% missing in the whole period (10 years). I would like to delete the whole series if they have missing values or return=0 in more than 50% of the cases in any One Year perid; no matter which year it is (2010, 2011 or 2012).

    I tried the code:

    foreach var of varlist * {
    bysort year: count if missing(`var')
    if r(N) >=128 drop `var'
    }

    However, this code even though counts the missing observations in each year for each company. It doesn’t drop the series if the missing values are more than 128 (50% of the trading days) in one of the years. My assumption is that counted variables are not stored, as in the first example r(N), but in some other way.

    I hope that you will find time to answer my question. Stay safe!

    Thank you in advance,
    Best regards,
    Artan

  • #2
    Hi Artan,
    It would have been helpful if you post a sample of the data using
    Code:
    datex
    Not withstanding that I am not able to see your data, I feel this could work
    Code:
    glo p = 0.5
    * Loop over variables
    foreach var of varlist * {
    bysort year: count if missing(`var')
    if (r(N)/_N) >= $p drop `var'
    }
    Mattia

    Comment


    • #3

      Hi Mattia, Thank you very much for your reply. Unfortunately the code you showed me doesn't work in my dateset. Attached i am attaching a portion of the data ( datasample.csv), since unfortunately i couldn't (not familiar) with the command datex. I hope this will help. Artan

      Comment


      • #4
        HI Mattia,

        In case you werent able to access the file in csv format. I am attaching a portion of the dataset as a picture.

        date1 year aaa bbb ccc
        04jan2010 2010 -3.036877 .5086856 .
        05jan2010 2010 2.5584 0 .
        06jan2010 2010 -.5061111 . .
        07jan2010 2010 -1.172597 0 .
        08jan2010 2010 -5.078258 2.022372 .
        11jan2010 2010 .665268 3.128467 .
        12jan2010 2010 -2.691656 -3.819891 .
        13jan2010 2010 -5.138979 -1.657924 .
        14jan2010 2010 .3242554 2.517216 .
        15jan2010 2010 -.1295305 -1.001068 .
        18jan2010 2010 .948019 -1.962829 .
        19jan2010 2010 3.170322 -1.440627 .
        20jan2010 2010 3.490402 .4755987 .
        21jan2010 2010 1.23675 -.4733475 -0.355


        Unfortunately, the code you suggested doesnt solve the issue. I think the problem is how it saves the counting through sequence r(N), which seems like it is storing only the last sequence of the last variable. Hope you will be able to help.

        Artan

        Comment


        • #5
          The by: prefix command is no use here as only the last returned value is accessible once it is done.

          This may help:

          Code:
          gen fraction_missing = .
          
          foreach var of varlist * {
              bysort year : replace fraction_missing = sum(missing(`var'))
              by year: replace fraction_missing = fraction_missing[_N] / _N
          
              su fraction_missing, meanonly
              if r(max) > 0.5 drop `var'
          }
          The condition that return is zero appears in #1 but is lost otherwise. Perhaps that implies


          Code:
          bysort year : replace fraction_missing = sum(missing(`var') | return == 0)
          Note that varlist * may be too wide, as it includes identifier and time variables, but presumably missing values are not an issue there.

          Comment


          • #6
            Hi Artan,

            Thanks for sharing the data.

            Are you trying to drop if the sum of the missings in all variables is more than 50%? If so, this should help.

            Code:
             egen varname = rmiss2(aaa bbb ccc)
             bysort year: egen tot_miss=total(varname)
              gen x = 1
            egen count= total(x), by(year)
            by year: drop if (tot_miss/(count*3))>0.5
            drop varname tot_miss x
            the *3 in the second last line depends on the number of firms you have. This makes the code a bit inelegant, but it should work.

            Mattia

            Comment


            • #7
              HI Nick and Mattia,

              Thank you very much for your help. You helped me immensely with your insight.

              I tried both your code's and they work perfectly. I have just one more question and sorry for bothering you so much. Is it possible to adjust this code in order to drop if the sum of missing and 0 returns of the variables in one year is bigger than 50%. I have some series with a lot of 0 returns, so i would like to include them to the count of missing?

              I tried some adjustments in your code Nick, but seems i am doing something wrong.

              gen fraction_missing = .
              gen fraction_zero = 0

              foreach var of varlist * {
              bysort year : replace fraction_missing = sum(missing(`var'))
              bysort year : replace fraction_zero = count if (`var'=0)
              by year: replace fraction_missing = fraction_missing[_N]+ fraction_zero[_N] / _N

              su fraction_missing, meanonly
              if r(max) > 0.5 drop `var'
              }

              Thank you very much once again to both of you for sharing your knowledge. I have learned a lot from your comments and the overall platform.

              Sincerely,
              Artan

              Comment


              • #8
                It is a surprise if #5 and #6 both work as #6 is directed at a quite different interpretation of your question. The first error in your code is that putting

                Code:
                count if `var' == 0
                as an assignment will be legal if (and only if) you have a numeric variable count but it is not a way to put the result of a separate count command into your variable, which I think is what you are hoping. .

                Even if that had been correct, the last replacecommand needs parentheses.

                I already signalled in #5 how to look for zero returns in your code. Here is the code again with that suggestion built in. I changed a variable name as what is being counted is more than whether values are missing.

                Code:
                gen fraction_bad = .
                
                foreach var of varlist * {
                    bysort year : replace fraction_bad = sum(missing(`var') | return == 0)
                    by year: replace fraction_bad = fraction_bad[_N] / _N
                
                    su fraction_bad, meanonly
                    if r(max) > 0.5 drop `var'
                }
                To answer a further question, clearly the fraction 0.5 could be any other fraction you want to use.

                More generally, on "seems I am doing something wrong": sure, but precisely what happened? Please read again FAQ Advice #12. What happened could be many different things, such as an error message (what was it?); you got something your didn't want (which was?); and so on.



                Comment


                • #9
                  Hi Nick,

                  I am sorry for the confusion i created and not analyzing more thoroughly the second part of your code. I tried your code with the added aspect of return==0. however i get this result.

                  . foreach var of varlist * {
                  2. bysort year : replace fraction_bad = sum(missing(`var') | return== 0)
                  3. by year: replace fraction_bad = fraction_bad[_N] / _N
                  4.
                  . su fraction_bad, meanonly
                  5. if r(max) > 0.5 drop `var'
                  6. }
                  return not found
                  r(111);

                  end of do-file

                  r(111);

                  I hadn't created a new variable return. Is this maybe the reason? The returns are the `var' themselves. I tried changing the return with `var'==0, but i got the same response. Hope you could help and thank you for the time you devoted in helping. I will follow your advice and read more thoroughly the FAQ Advice #12.

                  Comment


                  • #10
                    Indeed; absent a data example (see FAQ Advice #12) my guess was that you had a variable called return or you would edit the code to suit.

                    But if you change the code as you guessed, then the same error should be impossible.

                    ​​​​​​​

                    Code:
                    gen fraction_bad = .
                    
                    foreach var of varlist * {
                        bysort year : replace fraction_bad = sum(missing(`var') | `var' == 0)
                        by year: replace fraction_bad = fraction_bad[_N] / _N
                    
                        su fraction_bad, meanonly
                        if r(max) > 0.5 drop `var'
                    }

                    Comment


                    • #11
                      Hi Nick,

                      Yes i got a different error once i changed the return command as you also suggested.


                      . gen fraction_bad = .
                      (2,727 missing values generated)

                      .
                      . foreach var of varlist * {
                      2. bysort year : replace fraction_bad = sum(missing(`var') | `var' == 0)
                      3. by year: replace fraction_bad = fraction_bad[_N] / _N
                      4.
                      . su fraction_bad, meanonly
                      5. if r(max) > 0.5 drop `var'
                      6. }
                      type mismatch
                      r(109);

                      end of do-file


                      Comment


                      • #12
                        Still no data example. I guess one or more of your variables is string. Note again the warning previously given in #5

                        Note that varlist * may be too wide, as it includes identifier and time variables, but presumably missing values are not an issue there.
                        Your code instructs Stata to loop over all variables and if just one of them is string you can't compare its values with 0 (numeric) without that error message.

                        Code:
                        ds, has(type string)
                        will tell you which variables are string.

                        Comment


                        • #13
                          Perfect Nick. I resolved the issue. I guess i had missed one string variable.

                          Thank you very much once again!!!

                          Comment


                          • #14
                            Su - It is extremely bad form to post the same question twice in the same day. I wasted my time trying to help you when the question had already been answered. This listserv only survives because we are willing to donate time and you can help that by following the FAQ on asking questions.

                            Comment

                            Working...
                            X