Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • keep data by different dates

    Dear statalists,

    I encounter a problem related to keeping part of the data by different dates. I have around 1 million of observations with firms and date, but I want to keep the observations with specific dates, for example one day before and one day after 25jan2011, 04feb2011, 16march2011, 24march2011. I've copied part of my data using dataex. I would be very grateful if anyone can help me out!

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long date str6 firm
    18630 "004073"
    18631 "004073"
    18632 "004073"
    18633 "004073"
    18634 "004073"
    18637 "004073"
    18637 "004073"
    18638 "004073"
    18639 "004073"
    18639 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18640 "004073"
    18641 "004073"
    18645 "004073"
    18646 "004073"
    18647 "004073"
    18647 "004073"
    18647 "004073"
    18647 "004073"
    18647 "004073"
    18647 "004073"
    18647 "004073"
    18648 "004073"
    18651 "004073"
    18652 "004073"
    18652 "004073"
    18652 "004073"
    18652 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18653 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18654 "004073"
    18655 "004073"
    18655 "004073"
    18655 "004073"
    18658 "004073"
    18659 "004073"
    18659 "004073"
    18660 "004073"
    18660 "004073"
    18661 "004073"
    18662 "004073"
    18665 "004073"
    18666 "004073"
    18667 "004073"
    18668 "004073"
    end
    format %td date

  • #2
    This could be easier if you had an indicator in the dataset that identifies the focal dates. With a list as you provide, you can specify several -inrange()- conditions.

    Code:
    help inrange()
    Code:
    local dates 25jan2011 04feb2011 16march2011 24march2011
    local expression
    foreach date of local dates{
        local expression `expression' inrange(date,`=td(`date')-1',`=td(`date')+1')|
    }
    local expression= substr("`expression'", 1, length("`expression'")-1)
    display "keep if `expression'"
    keep if `expression'
    Res.:

    Code:
    . display "keep if `expression'"
    keep if inrange(date,18651,18653)| inrange(date,18661,18663)| inrange(date,18701,18703)| inrange(date,18709,18711)

    Comment


    • #3
      See https://www.stata.com/support/faqs/d...s-for-subsets/ for another way to do it.

      That is, create a dataset with a variable date with values such as below, and then merge and look for the intersection of datasets.


      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float date
      18651
      18652
      18653
      18661
      18662
      18663
      18701
      18702
      18703
      18709
      18710
      18711
      end

      Comment


      • #4
        Thank you so much for both Andrew and Nick! It helps a lot!

        Comment


        • #5
          I have a follow on question. After I only keep the observations in certain dates, how can I calculate the mean of the variable CSS or count the observation by each entity within the 3-day period of the focal dates?
          I tried bysort RP_ENTITY_ID date: egen MeanCSS=mean(CSS) but it only generate the value for the specific date, and what I want is a value for example one day before and one day after 25jan2011, 04feb2011, 16march2011, 24march2011.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long date str6 RP_ENTITY_ID double CSS
          18653 "004073" 55
          18653 "004073" 56
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 56
          18653 "004073" 55
          18653 "004073" 56
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 55
          18653 "004073" 52
          18653 "004073" 19
          18653 "004073" 17
          18653 "004073" 19
          18653 "004073" 56
          18653 "004073" 50
          18653 "004073" 56
          18653 "004073" 50
          18653 "004073" 52
          18750 "004073" 43
          18750 "004073" 43
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 46
          18751 "004073" 50
          18751 "004073" 19
          18751 "004073" 52
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 46
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 17
          18751 "004073" 50
          18751 "004073" 47
          18751 "004073" 38
          18751 "004073" 47
          18751 "004073" 47
          18751 "004073" 47
          18751 "004073" 19
          18751 "004073" 65
          18751 "004073" 55
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18751 "004073" 50
          18764 "004073" 47
          18764 "004073" 50
          18765 "004073" 50
          18772 "004073" 50
          18772 "004073" 53
          18772 "004073" 55
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 50
          18772 "004073" 52
          18772 "004073" 53
          18772 "004073" 50
          18772 "004073" 47
          18772 "004073" 47
          18772 "004073" 47
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 55
          18772 "004073" 19
          18772 "004073" 19
          18772 "004073" 19
          18772 "004073" 50
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 53
          18772 "004073" 39
          18826 "004073" 47
          18826 "004073" 47
          18834 "004073" 47
          18834 "004073" 47
          18834 "004073" 47
          18834 "004073" 47
          18834 "004073" 50
          18834 "004073" 47
          18834 "004073" 47
          18834 "004073" 47
          18834 "004073" 47
          end
          format %td date
          ------------------ copy up to and including the previous line ------------------

          Comment


          • #6
            See rangestat from SSC.

            Comment


            • #7
              Thank Nick! I really appreciate your help!

              I try to use the code rangestat (mean) CSS , interval(date 18651 18653) by( RP_ENTITY_ID ). A CSS_mean variable was generated but all missing values. And I can't add multiple intervals in the code, but I have different time window around different dates. Do you have any suggestions?

              Comment


              • #8
                Your rangestat call asked for averaging over [date + 18651, date + 18653] which would be dates several decades into the future that you won't have.

                That is, in the syntax (varname #1 #2) #1 #2 are offsets, not absolute values.

                Even if your code had tackled the interval [18651, 18653] what about the others, as you say?

                It is more likely that something like

                Code:
                rangestat (count) count = CSS (mean) mean = CSS , interval(date -2 0) by( RP_ENTITY_ID )
                will get you where you want. Focus on results for which count is 3.

                Help for rangestat says (emphasis added):

                interval(keyvar low high) is required. keyvar is a numeric variable. The lower and upper bound of the
                closed interval to use for each observation can be specified using a numeric variable, a #, or a system
                missing value. If a # is used, the bound for each observation is computed by adding # to keyvar. If
                low is specified using a system missing value, low is set to missing for all observations. rangestat
                applies the same rules as inrange() for missing bounds.
                Last edited by Nick Cox; 15 Dec 2023, 04:01.

                Comment


                • #9
                  Hi Nick, thank you for your kind explanations and also patience. Is it possible for me to use -inrange- so that I can select the dates within sepecific time window? After I ran the code you wrote, I realize that I want a value (mean or count) for each entitly for each period. For example, if the entity has the CSS in both 25 jan 2011, 26 jan 2011, and 27 jan 2011, I want the value generated is the same across the three dates. Is that feasible?

                  Comment


                  • #10
                    1. How many reference dates do you have? What is the minimum spacing between them? I find it hard to suggest ways forward without a better picture of your dataset, meaning a better data example.

                    2. That said, changing the option to

                    interval(date -2 2)

                    should return a count of 3 for dates within each triple (that doesn't overlap with any others) and 0 otherwise. If that doesn't help, then I am guessing too wildly, and please see point 1.
                    Last edited by Nick Cox; 17 Dec 2023, 04:49.

                    Comment


                    • #11
                      I have around 20-30 dates per year as reference dates. The minimum spacing is 3 or 4 days for a couple reference dates. For example, 08 july, 11 july, 15 july. Others have wider spacing such as 1 week.

                      Comment


                      • #12
                        So, the suggestion 2 in #10 won't work as some intervals will overlap.

                        Comment


                        • #13
                          Eventually I want to have one value per entity in each reference dates time window, so that I merge the data with another dataset that contain the reference dates.

                          Comment


                          • #14
                            I need to think more about the overlap. Probably I need to drop the dates with the overlap as there may be confounding effects.

                            Comment


                            • #15
                              I ran the suggestion 2 in #10 for the dates that doesn't have overlap, and it works well. I need to think more about the overlap.

                              One issue now I face is merge the data back to the dataset with the reference dates. For example, if the reference date is actually 02 feb, and the date here is 01 feb, but it should be merged together.Or the dates here are 02may 04may, and the reference date is 03 may.

                              Code:
                              * Example generated by -dataex-. For more info, type help dataex
                              clear
                              input long date str6 RP_ENTITY_ID double(CSS count mean)
                              18749 "07EC43" 55 13  54.23076923076923
                              18749 "07EC43" 55 13  54.23076923076923
                              18749 "07EC43" 50 13  54.23076923076923
                              18749 "07EC43" 50 13  54.23076923076923
                              18749 "07EC43" 55 13  54.23076923076923
                              18749 "07EC43" 55 13  54.23076923076923
                              18750 "07EC43" 55 13  54.23076923076923
                              18750 "07EC43" 55 13  54.23076923076923
                              18750 "07EC43" 55 13  54.23076923076923
                              18750 "07EC43" 55 13  54.23076923076923
                              18750 "07EC43" 55 13  54.23076923076923
                              18750 "07EC43" 55 13  54.23076923076923
                              18750 "07EC43" 55 13  54.23076923076923
                              18758 "07EC43" 51  5               53.8
                              18758 "07EC43" 51  5               53.8
                              18758 "07EC43" 51  5               53.8
                              18758 "07EC43" 51  5               53.8
                              18758 "07EC43" 65  5               53.8
                              18773 "07EC43" 52  3                 52
                              18773 "07EC43" 52  3                 52
                              18773 "07EC43" 52  3                 52
                              18805 "07EC43" 52  2                 52
                              18805 "07EC43" 52  2                 52
                              18828 "07EC43" 47  2                 47
                              18828 "07EC43" 47  2                 47
                              18834 "07EC43" 50  4                 50
                              18834 "07EC43" 50  4                 50
                              18834 "07EC43" 50  4                 50
                              18834 "07EC43" 50  4                 50
                              18850 "07EC43" 55  4               53.5
                              18850 "07EC43" 55  4               53.5
                              18850 "07EC43" 52  4               53.5
                              18850 "07EC43" 52  4               53.5
                              18933 "07EC43" 55 12 54.166666666666664
                              18933 "07EC43" 55 12 54.166666666666664
                              18933 "07EC43" 55 12 54.166666666666664
                              18933 "07EC43" 55 12 54.166666666666664
                              18933 "07EC43" 55 12 54.166666666666664
                              18933 "07EC43" 55 12 54.166666666666664
                              18933 "07EC43" 55 12 54.166666666666664
                              18934 "07EC43" 53 12 54.166666666666664
                              18934 "07EC43" 53 12 54.166666666666664
                              18934 "07EC43" 53 12 54.166666666666664
                              18934 "07EC43" 53 12 54.166666666666664
                              18934 "07EC43" 53 12 54.166666666666664
                              18651 "090427" 52  2                 52
                              18651 "090427" 52  2                 52
                              18661 "090427" 53  1                 53
                              18701 "090427" 52  2                 52
                              18701 "090427" 52  2                 52
                              18709 "090427" 55  5               51.8
                              18709 "090427" 50  5               51.8
                              18709 "090427" 50  5               51.8
                              18709 "090427" 52  5               51.8
                              18709 "090427" 52  5               51.8
                              18736 "090427" 52  4              52.75
                              18736 "090427" 52  4              52.75
                              18736 "090427" 57  4              52.75
                              18738 "090427" 50  4              52.75
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 53 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 53 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18749 "090427" 50 21 50.666666666666664
                              18751 "090427" 50 21 50.666666666666664
                              18751 "090427" 50 21 50.666666666666664
                              18751 "090427" 52 21 50.666666666666664
                              18751 "090427" 50 21 50.666666666666664
                              18751 "090427" 52 21 50.666666666666664
                              18751 "090427" 50 21 50.666666666666664
                              18751 "090427" 52 21 50.666666666666664
                              18751 "090427" 50 21 50.666666666666664
                              18751 "090427" 52 21 50.666666666666664
                              18756 "090427" 52  8             50.875
                              18756 "090427" 52  8             50.875
                              18756 "090427" 50  8             50.875
                              18756 "090427" 50  8             50.875
                              18756 "090427" 50  8             50.875
                              18757 "090427" 53  8             50.875
                              18758 "090427" 50  8             50.875
                              18758 "090427" 50  8             50.875
                              18764 "090427" 50  5               52.8
                              18764 "090427" 50  5               52.8
                              18765 "090427" 57  5               52.8
                              18765 "090427" 57  5               52.8
                              18766 "090427" 50  5               52.8
                              18771 "090427" 50  2                 50
                              18772 "090427" 50  2                 50
                              end
                              format %td date

                              Comment

                              Working...
                              X