Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting inconsistent dates stored as strings to Stata dates

    Hi Statalist,
    Sorry for asking questions about date and time. I really did look through the
    help datetime.

    I am working with an inconsistent string variable from old parish registers, that have text, brackets, years, months and days mixed in different orders, while I need to extract the year. What gave me hope was that the code Y in mask for date() only converts 4-digit numbers Since this is the only reliable consistency I have found. So I managed to convert a fraction of the observations using:
    Code:
    gen bapdate = date(daapsdato, "Y#")
    It gave me changes where the year was the first four digits, but I have been struggling to find a function to ignore the text and characters before Y. # seems to have limited capacity when placed in front of #Y.

    The topyear function gave me hope for sorting away impossible dates, but it seems to only function for 2-digit years? I am hoping to limit this converstion to those 4-digit numbers that start with 16, 17, 18 and 19, as my data is primarily from those centuries. Sorry for the inconvenience. Sincerely a historian from Norway
    Attached Files

  • #2
    Hi Marko,

    First, could you please provide a data example generated with the -dataex- command? It should be helpful here, since we want to be sure we have examples of all of the special cases, and because it outputs some data in a command that we can insert into Stata and work with on our end.

    There are a few general ways to deal with this. If every string-date pattern is interpretable by the date() function, you should be able to interpret each pattern as a new variable, then combine variables into a single date encoded variable. If certain strings cannot be interpreted as dates by the date() function, then this is a little more complicated, but you can use the string manipulation functions to extract the relevant portion of the string and convert that to a Stata date.

    Finally, if the dataset is small enough or if there are few enough odd string patterns, you can open the data in the data editor and manually change the value of the strings. This might sound like heresy because this can be error prone and difficult to reproduce, particularly when compared to a procedure in a do file. However, it can be relatively easy to do this manually for highly idiosyncratic data compared to a programmatic solution. Just be sure to back up the original data and clearly document the changes you've made.

    Edit: Again, a -dataex- example would be necessary if you're looking for more specific advice, or even a possible implementation.
    Last edited by Daniel Schaefer; 28 Sep 2022, 09:27.

    Comment


    • #3
      Thank you for your reply and advice Daniel.
      Here is a sample of the variable in question. These are specified baptism dates.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str142 daapsdato
      "07-23"                                                         
      "08-31"                                                         
      "07-19"                                                         
      "08-08"                                                         
      "D. Sexagesima [1695-01-27] E:Søndag før fastelaven"          
      "D. Invocavit [1695-02-10] E:1. søndag i fasten"               
      "D. Heminisc [1695-02-17] E:2. søndag i fasten"                
      "D. Oculi [1695-02-24] E:3. søndag i fasten"                   
      "D. Lartare [1695-03-03] E:4. søndag i fasten"                 
      "D. Lartare [1695-03-03] E:4. søndag i fasten"                 
      "D. Lartare [1695-03-03] E:4. søndag i fasten"                 
      "D. Judica [1695-03-10] E:5. søndag i fasten"                  
      "D. Palmarum [1695-03-17] E:Palmesøndag"                       
      "F. 2-de Paschatos [1695-03-25] E:2. Påskedag"                 
      "D. Quasimod [1695-03-31] E:1. søndag etter påske"            
      "D. Misericord [1695-04-07] E:2. søndag etter påske"          
      "D. Cantate [1695-04-21] E:4. søndag etter påske"             
      "F. 1. Pentecoste [1695-05-12] E:1. Pinsedag"                   
      "D. 2. P. Trinit[[06-02]"                                       
      "D. 3. P. Trinit[[06-09]"                                       
      "D. 6. P. Trinit[[06-30]"                                       
      "D. 6. P. Trinit[[06-30]"                                       
      "D. 7. P. Trinit[[07-07]"                                       
      "D. 11. P. Trinit[[08-04]"                                      
      "D. 12. P. Trinit[[08-11]"                                      
      "D. 13. P. Trinit[[08-18]"                                      
      "D. 13. P. Trinit[[08-18]"                                      
      "D. 13. P. Trinit[[08-18]"                                      
      "D. 14. P. Trinit[[08-25]"                                      
      "D. 16. P. Trinit[[09-08]"                                      
      "D. 17. P. Trinit[[09-15]"                                      
      "18. P. Trinit [1695-09-22"                                     
      "Festo Michael [1695-09-29] E:Mikkelsmesse"                     
      "D. 20. P. Trinit[[10-06]"                                      
      "D. 21. P. Trinit[[10-13]"                                      
      "D. 21. P. Trinit[[10-13]"                                      
      "D. 23. P. Trinit[[10-27]"                                      
      "D. 22. P. Trinit[[10-20]"                                      
      "D. 22. P. Trinit[[10-20]"                                      
      "D. 25. P. Trinit[[11-10]"                                      
      "D. 25. P. Trinit[[11-10]"                                      
      "D. 26. P. Trinit[[11-17]"                                      
      "D. 26. P. Trinit[[11-17]"                                      
      "D. 27. P. Trinit[[11-24]"                                      
      "1. advent [1695-12-01"                                         
      "2. advent [1695-12-08"                                         
      "F. 1. Nativ [1695-12-25] E:1. Juledag"                         
      "F. 1. Nativ [1695-12-25] E:1. Juledag"                         
      "F. 3. Nativ [1695-12-27] E:3. Juledag"                         
      "F. 3. Nativ [1695-12-27] E:3. Juledag"                         
      "F.Cise Christi [1696-01-01] E:Nyttårsdag"                     
      "F.Cise Christi [1696-01-01] E:Nyttårsdag"                     
      "D.Cise Epiph [1696-01-01] E:Nyttårsdag"                       
      "F.Epiph [1696-01-06] E:Hellig tre kongers dag"                 
      "D.1.Epiph [1696-01-12] E:1.s. e. H3Kongers dag"                
      "D.1.Epiph [1696-01-12] E:1.s. e. H3Kongers dag"                
      "D.2.Epiph [1696-01-19] E:2.s. e. H3Kongers dag"                
      "D.3.Epiph [1696-01-26] E:3.s. e. H3Kongers dag"                
      "D.3.Epiph [1696-01-26] E:3.s. e. H3Kongers dag"                
      "D.3.Epiph [1696-01-26] E:3.s. e. H3Kongers dag"                
      "[01-29"                                                        
      "F.Purif Maria [1696-02-02] E:Kyndelsmesse, Marias renselsesdag"
      "F.Purif Maria [1696-02-02] E:Kyndelsmesse, Marias renselsesdag"
      "[02-08"                                                        
      "D.Septuages[[02-09]"                                           
      "D.Sexages [1696-02-16] E:Søndag før fastelaven"              
      "D.Esto Michi [1696-02-23] E:Fastelavenssøndag"                
      "D.Invocevitas [1696-03-01] E:1. søndag i fasten"              
      "D.Invocevitas [1696-03-01] E:1. søndag i fasten"              
      "D. Reminisc [1696-03-01] E:2. søndag i fasten"                
      "D. Judica [1696-03-29] E:5. søndag i fasten"                  
      "D. Lætare [1696-03-22] E:4. søndag i fasten"                 
      "D. Misencord [1696-04-26] E:2. søndag etter påske"           
      "D. Jubilate [1696-05-03] E:3. søndag etter påske"            
      "[05-08"                                                        
      "D. Cantate [1696-05-10] E:4. søndag etter påske"             
      "D. Cantate [1696-05-10] E:4. søndag etter påske"             
      "D. Cantate [1696-05-10] E:4. søndag etter påske"             
      "D. Rogate [1696-05-17] E:5. søndag etter påske"              
      "F. 2. Pentecost [1696-06-01] E:2. Pinsedag"                    
      "D. 3. P Trinit[[06-28]"                                        
      "D. 3. P Trinit[[06-28]"                                        
      "D. 3. P Trinit[[06-28]"                                        
      "6. P. Trinit [1696-07-19"                                      
      "6. P. Trinit [1696-07-19"                                      
      "[07-16"                                                        
      "8. Trinit [1696-08-02"                                         
      "8. Trinit [1696-08-02"                                         
      "10. Trinit [1696-08-16"                                        
      "12. Trinit [1696-08-30"                                        
      "13. Trinit [1696-09-06"                                        
      "14. Trinit [1696-09-13"                                        
      "17. Trinit [1696-10-04"                                        
      "F. Omni. Sanct. [1696-11-01] E:Alle helgens dag"               
      "F. 3. Nativ [1696-12-27] E:3. Juledag"                         
      "F. 3. Nativ [1696-12-27] E:3. Juledag"                         
      "D. 1. P. Epiph [1697-01-10] E:1. s. e. Hellig 3-kongers d."    
      "D. Sexages. [1697-02-07] E:Søndag før fastelaven"            
      "D. Remin. [1697-02-28] E:2. søndag i fasten"                  
      "D. Remin. [1697-02-28] E:2. søndag i fasten"                  
      end
      I am afraid the data set is very large, I have been forced to extract a portion of it 825 000, of the 9.3 million I have received. Running commands with that many took half an hour.

      Comment


      • #4
        There are definitely a few interesting problems to solve here. I don't believe there is a one command solution for this kind of thing. Basically, we need to program a solution.

        First, I'm noticing that in the example data you provide, dates appear to take on exactly two forms: there are dash separated dates with a year-month-day format and dash separated dates without a year but with a month-day format. These dates are contained within an arbitrary (from my point of view) larger string. Assuming this pattern is consistent across the entire dataset, you can extract that information from each string using a regular expression. There are a few ways to do this, but I prefer to extract each pattern one at a time like so:

        Code:
        // extract dates of the form 1695-01-27
        gen datesubstring = regexs(0) if regexm(daapsdato, "[0-9]+[\-]+[0-9]+[\-]+[0-9]+")
        // extract dates of the form 07-23
        replace datesubstring = regexs(0) if regexm(daapsdato, "[0-9]+[\-]+[0-9]+") ///
                                          & datesubstring == ""
        Next, we need to add the year to the dates where the year is missing. This isn't trivial. I provide a lengthy explanation in the next post.

        If all goes well, you should be able to convert the string to a stata date like so:

        Code:
        gen stata_date = date(datesubstring, "YMD")
        format stata_date %td
        list stata_date

        Comment


        • #5
          Now comes the hard part. How can we assign the correct year to dates with the month-day pattern? Based on your example data, it looks like entries are already ordered by the date. We can take advantage of this property and fill in the missing information based on what we know about the previous entries. The key takeaway here is that in cases where the year is unknown, we assume it is the year of the most recent previous entry.

          Code:
          // Implementation 1
          // use the - character to split each date substring into its component parts.
          split datesubstring, parse("-")
          // fill in the most recently entered year in the year column.
          replace datesubstring1 = datesubstring1[_n-1] if datesubstring3 == ""
          // merge with the original string.
          replace datesubstring = datesubstring1 + "-" + datesubstring if datesubstring3 == ""
          Note that this code does not work for the first few entries of the example dataset because there is no "last known year." If this is the case in your current dataset, you can fill in the year for the first entry manually based on your own knowledge of the data, or if you don't know the year, drop those dates from the analysis.

          There may be other problems. For instance, is it possible that someone created an entry for January 1st using only the month and the day but without including a year? We might be able to handle this edge case if we can infer from an out-of-order month that the year has actually rolled over. Handling this case will require something a little bit more complicated. There might actually be a compact vectorized way to do this. I've been working on one, but I'm running out of time today. I tend to think in for loops and primitives so here is an initial implementation. If I can manage it I will try to repost a more compact solution like the one above tonight.

          Code:
          // Implementation 2
          split datesubstring, parse("-")
          // track the current month and previous month as we loop through the data.
          local previous_month = 0
          local current_month = 0
          // loop through each observation one by one.
          forv obs = 1/`=_N'{
              // In this case we already have the current year.
              if datesubstring3[`obs'] != ""{
                  // Update the current year.
                  local current_year = datesubstring1[`obs']
                  // Update the previous month to prepare for the next iteration.
                  local previous_month = datesubstring2[`obs']
              }
              // In this case we need to set the year.
              if datesubstring3[`obs'] == "" & ("`current_year'" != ""){
                  // update the current month.
                  local current_month = datesubstring1[`obs']
                  // has the year rolled over, but the entry doesn't reflect that?
                  if real("`current_month'") < real("`previous_month'"){
                      // if so, add 1 to the current year.
                      local current_year = real("`current_year'") + 1
                  }
                  // in any case, update the previous month.
                  local previous_month = "`current_month'"
                  // finally, append the current year to the string.
                  replace datesubstring = "`current_year'" + "-" + datesubstring[`obs'] in `obs'
              }
          }
          I also wrote a small dataset to test the new code:

          Code:
          clear
          input str142 daapsdato
          "D. Remin. [1697-02-28] E:2. søndag i fasten"  
          "01-28"
          "02-3"
          "4-05"
          "3-06"
          "1970-01-18"
          "02-28"
          "1971-12-31"
          "1-1" 
          end
          Even with this updated solution, if one entry is for, say, 1670-01-18 but only says [01-18] and the very next entry is for 1671-02-18 but the entry only says [02-18] then the second entry will have the wrong year associated with it. The point here is that the underlying assumptions I am making about your data really matter, and when those assumptions are violated then the code will make incorrect inferences. Am I correct that the entries should always be ordered in time? I notice days of the month are not always in order in the example data. Ultimately it is up to you to understand your data, the assumptions that are justified, and your willingness to introduce error.

          Comment


          • #6
            Thank you for going above and beyond my request Daniel.
            I have managed to convert the years within the variable with the code you provided:
            Code:
            gen year = ustrregexs(0) if ustrregexm(daapsdato, "[1][6-9][0-9][0-9]")
            This helped me grab the possible years, and exclude those where I found months and days did not have a dash seperator. This only gave me about 32,000 years among the 825,000. Your train of thought got me thinking that it should be possible to assign at least a big portion of the missing observations with years through chronology. As I have a different variable that lets me sort them in chronological order, but the issue is as you mention that it is not always certain that the year is provided for due to transcribing error.

            Realistically speaking even the smallest parishes had a few births each year, and the priest wrote them down as they happened, but the transcription tool did not necessarily pick up the year on each entry in the register. But it is safe to assume that a year should have been transcribed when ever it occured in the entry, so if the variable that controls for chronology is not broken, the last transcribed year should in most cases be correct within "daapsdato". I am willing to introduce the error you speak of by implementing this solution.

            I must admit that my limited Stata skills are insufficient to understand the complex code you provided in your last reply. Hoping that some of this is possible to copy for effect.
            Apologies for the late replies, the time difference takes some getting used to.

            Marko

            Comment


            • #7
              Hi Marko,

              it is safe to assume that a year should have been transcribed when ever it occured in the entry, so if the variable that controls for chronology is not broken, the last transcribed year should in most cases be correct within "daapsdato".
              In that case implementation #1 (above) should work for your purposes. Do you only care about the year portion of the date, (not the mouth and day)? If so, you can simply fill in the missing years like so:

              Code:
              replace year = year[_n-1] if year == ""
              If you do care about the month and day, things could get a little more complicated. Congratulations! I believe you have found an elegant simplification of the problem.

              Edit: note that, again, if the first few dates don't have a year associated with them (as in the example data), no year will be filled in.
              Last edited by Daniel Schaefer; 30 Sep 2022, 09:31.

              Comment

              Working...
              X