Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop variables if no entry for monthly dates within a year

    Hello,

    I am new to Stata. Can you please help me with the following issue: for different companies, which I labeled as id, I have different variables describing some stock characteristics.
    I would like to keep the companies that have data for the stock characteristics for every month within a year.
    The issue is that the data does not appear as missing, is just simply absent.
    For example, if we consider the year 2017, I have the following entries for the dates: date id
    31jan2017 1
    28feb2017 1
    31mar2017 1
    30apr2017 1
    31may2017 1
    30jun2017 1
    31jul2017 1
    31aug2017 1
    30sep2017 1
    31oct2017 1
    30nov2017 1
    31dec2017 1

    and then data entries for other variables.
    How do I drop observations that have only 3 entries for the dates above, as opposed to 12, e.g. for id=2, I only have observations for 28feb2017, 30apr2017 and 30sep2017?

    I thought about doing a bysort by id and date, i.e. bysort id (date): drop if
    but I don't know how to write the condition, as the data is not missing ., but simply not present.

    Thank you for your help.



  • #2
    I am going to assume that your date variable is a numeric daily date, so that

    Code:
    bysort id : egen in2017 = total(year(date) == 2017)
    counts observations in 2017 and so

    Code:
    drop if in2017 == 3
    drops identifiers with just 3 such and

    Code:
    keep if in2017 == 12
    does ... you guess.

    If the latter is essentially your criterion, the previous command is not needed.

    If you only have data for 2017 and no other years (which is not explicit) then

    Code:
    bysort id : keep if _N == 12
    is sufficient.

    Comment


    • #3
      Hi Nick,

      Thank you so much for your answer.
      I have data for several years, and the missing data for some of the companies varies as well. In my example above, I considered just 3 possible data points.
      So is it enough to do: bysort id: keep if _N ==12, even if I have several years?
      My date has the following format: DDMMMYYYY.

      Thank you!

      Comment


      • #4
        Let me repeat the condition for the last command to work: "If you only have data for 2017 and no other years".

        DDMMMYYYY is not a Stata display format. If what you mean is that your date is string, then please tell us.

        Code:
        describe date
        will tell you.
        Last edited by Nick Cox; 04 Mar 2019, 12:51.

        Comment


        • #5
          You're right, sorry!
          Is there a way to do a bysort by id and year, and then specify the condition keep if _N ==12?

          Thank you!

          Comment


          • #6
            Code:
            bysort id year : keep if _N == 12
            will keep complete years for each identifier (assuming that you have a variable year).

            Comment


            • #7
              I get the following output:

              storage display value
              variable name type format label variable label
              ---------------------------------------------------------------------------------------------------------------------------------
              date long %d Monthly Date

              Comment


              • #8
                I generated a variable year as:

                g yr = year(date)

                One question before running the code you proposed: is there a special way to define _N?

                Thank you!

                Comment


                • #9
                  I am not sure that I understand that. _N is the number of observations you have, with the twist that under by: it is the number in each group. You don't define it so much as specify which group(s) you are referring to, the whole dataset or one or more parts.

                  Comment


                  • #10
                    Perfect, it worked, thank you so much!

                    Comment


                    • #11
                      I have a follow-up question: I am trying to compute returns given monthly prices in a year, for every firm.
                      For some reason, it returns that the data is not sorted.
                      If I do a tsset date id and then generate returns as gen ret=price/l.price-1 the missing observations generated are not the correct ones (i.e. the ones for the first price obs, for which there is no lag price).
                      If instead I do a tsset id date and then generate returns as gen ret=price/l.price-1, the ret variable has only missing observations.
                      Can you please help me with this issue?
                      I also tried using bysort id (date): gen ret=price/l.price-1, but I still get only missing values.
                      Thank you!

                      Edit: I fixed the issue. I had to organize first my dates into monthly dates.
                      Last edited by Astrid Koss; 05 Mar 2019, 14:35.

                      Comment

                      Working...
                      X