Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying first date of complete data set by subgroup

    Hi. I have data for covid numbers across states and districts (akin to counties) in India for a period of 13 months. However, the date at which data started being collected differs across districts. How do I find the first date that data was collected across all districts? Basically, I want to find the first date at which a complete data set was recorded for all districts? Thanks a lot!

    Code:
    input str40 state str25 district str10 date long confirmed
    
    "Andhra Pradesh"              "Visakhapatnam"  "2020-04-26"   22
    "Andhra Pradesh"              "West Godavari"  "2020-04-26"   51
    "Andhra Pradesh"              "Y.S.R. Kadapa"  "2020-04-26"   58
    "Arunachal Pradesh"           "Lohit"          "2020-04-26"    1
    "Assam"                       "Unknown"        "2020-04-26"   36
    "Bihar"                       "Arwal"          "2020-04-26"    4
    "Bihar"                       "Aurangabad"     "2020-04-26"    2
    "Bihar"                       "Banka"          "2020-04-26"    2
    "Bihar"                       "Begusarai"      "2020-04-26"    9
    "Bihar"                       "Bhagalpur"      "2020-04-26"    5
    "Bihar"                       "Bhojpur"        "2020-04-26"    2
    "Bihar"                       "Buxar"          "2020-04-26"   25
    "Bihar"                       "East Champaran" "2020-04-26"    5
    "Bihar"                       "Gaya"           "2020-04-26"    6
    "Bihar"                       "Gopalganj"      "2020-04-26"   12
    "Bihar"                       "Jehanabad"      "2020-04-26"    1
    "Bihar"                       "Kaimur"         "2020-04-26"   14
    "Bihar"                       "Lakhisarai"     "2020-04-26"    1
    "Bihar"                       "Madhepura"      "2020-04-26"    1
    "Bihar"                       "Munger"         "2020-04-26"   68
    "Bihar"                       "Nalanda"        "2020-04-26"   34
    "Bihar"                       "Nawada"         "2020-04-26"    3
    "Bihar"                       "Patna"          "2020-04-26"   33
    "Bihar"                       "Rohtas"         "2020-04-26"   15
    "Bihar"                       "Saran"          "2020-04-26"    3
    "Bihar"                       "Siwan"          "2020-04-26"   30
    "Bihar"                       "Vaishali"       "2020-04-26"    2
    "Chandigarh"                  "Chandigarh"     "2020-04-26"   36
    "Chhattisgarh"                "Bilaspur"       "2020-04-26"    1
    "Chhattisgarh"                "Durg"           "2020-04-26"    1
    "Chhattisgarh"                "Korba"          "2020-04-26"   28
    "Chhattisgarh"                "Raipur"         "2020-04-26"    6
    "Chhattisgarh"                "Rajnandgaon"    "2020-04-26"    1
    "Delhi"                       "Delhi"          "2020-04-26" 2918
    "Goa"                         "Unknown"        "2020-04-26"    7
    "Gujarat"                     "Ahmedabad"      "2020-04-26" 2181
    "Gujarat"                     "Anand"          "2020-04-26"   49
    "Gujarat"                     "Aravalli"       "2020-04-26"   18
    "Gujarat"                     "Banaskantha"    "2020-04-26"   28
    "Gujarat"                     "Bharuch"        "2020-04-26"   29
    "Gujarat"                     "Bhavnagar"      "2020-04-26"   40
    "Gujarat"                     "Botad"          "2020-04-26"   12

  • #2
    Scott, if I understand correctly, you're looking for the maximum of the starting dates of all districts. The code below stores the maximum in the variable "first_date".

    Code:
    gen date_num = date(date, "YMD")
    format date_num %td
    
    egen first_date = max(date_num)
    format first_date %td

    Comment


    • #3
      Hi Fei. Thanks for this quick response. The code above gives me the maximum date of all districts, i.e. the last date on which the data was collected. I am looking for the earliest staring date across all districts. Let's say the data set spans from 12 April 2020 - May 23 2021. District A started collecting data on 13 Apr 2020, B on 16 Apr 2020, C on 24 April 2020, and D on 6 June 2020. The date that I'm looking for then is 6 June 2020. The code above gives me May 23 2021 instead. Would you happen to know how I can find the earliest date at which the data is complete?

      Comment


      • #4
        Scott, what is your data structure: one observation per district or multiple observations per district? What does the variable "date" mean in your data example: Is it the starting date of each district, or simply a date among many dates within a district?

        Comment


        • #5
          A district has multiple observations across dates. Each date is present only once. The "date" variable is simply a date among many dates within a district, it is the date that corresponds to the number of confirmed covid cases in that row.

          I've pasted below a data sample for just one district. A state has several districts and the dataset has 28 states.


          Code:
          clear
          input str40 state str25 district str10 date long confirmed
          "Madhya Pradesh" "Agar Malwa" "2020-04-26" 11
          "Madhya Pradesh" "Agar Malwa" "2020-04-27" 11
          "Madhya Pradesh" "Agar Malwa" "2020-04-28" 11
          "Madhya Pradesh" "Agar Malwa" "2020-04-29" 12
          "Madhya Pradesh" "Agar Malwa" "2020-04-30" 12
          "Madhya Pradesh" "Agar Malwa" "2020-05-01" 12
          "Madhya Pradesh" "Agar Malwa" "2020-05-02" 12
          "Madhya Pradesh" "Agar Malwa" "2020-05-03" 12
          "Madhya Pradesh" "Agar Malwa" "2020-05-04" 12
          "Madhya Pradesh" "Agar Malwa" "2020-05-05" 12
          "Madhya Pradesh" "Agar Malwa" "2020-05-06" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-07" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-08" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-09" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-10" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-11" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-12" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-13" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-14" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-15" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-16" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-17" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-18" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-19" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-20" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-21" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-22" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-23" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-24" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-25" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-26" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-27" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-28" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-29" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-30" 13
          "Madhya Pradesh" "Agar Malwa" "2020-05-31" 13
          "Madhya Pradesh" "Agar Malwa" "2020-06-01" 13
          "Madhya Pradesh" "Agar Malwa" "2020-06-02" 13
          "Madhya Pradesh" "Agar Malwa" "2020-06-03" 13
          "Madhya Pradesh" "Agar Malwa" "2020-06-04" 13
          "Madhya Pradesh" "Agar Malwa" "2020-06-05" 14

          Comment


          • #6
            Code:
            gen date_num = date(date, "YMD")
            format date_num %td
            
            bys state district: egen date_min = min(date_num)
            egen first_date = max(date_min)
            format first_date %td
            
            drop date_num date_min

            Comment


            • #7
              See also https://www.stata.com/support/faqs/d...ing-last-date/

              Comment


              • #8
                Thanks, Fei and Nick. I used the following code from the link to give me the result I was looking for:


                Code:
                  
                gen date_num = date(date, "YMD")
                format date_num %td  
                
                by state district (date_num), sort: gen firstdate = date_num[1]
                format firstdate %td

                Comment


                • #9
                  What you have done seems correct for retrieving the first date within a group.

                  Comment

                  Working...
                  X