Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Binary if one variable is after another date

    Hi all,

    I'm trying to create a new binary variable that has a '1' if the date and time for the observation in 'Stage 1' is after the date and time in MessageTime (and a '0' if it isn't). I'm having difficulties reformatting the time and date in MessageTime to be the same as that for Stage1 - and I'm not too clear on how to create a binary premised on the above relationship between two times and dates. Any help would be very much appreciated.

    Thank you!

    ---

    Dataex export below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int newid double Stage1 str16 MessageTime
     614                  . "06-10-2018 11:27"
     388 1848413205000.0002 "06-10-2018 15:32"
    1639      1849260197000 "06-10-2018 20:58"
    1759                  . "06-10-2018 21:54"
     370 1849652584000.0005 "06-10-2018 23:48"
     271                  . "06-10-2018 22:42"
     261                  . "06-10-2018 22:30"
    end
    format %tcnn/dd/ccYY_hh:MM Stage1

  • #2
    It's not about reformatting MessageTime. MessageTime is simply a string variable, not a Stata clock variable, so you have to convert it. Once you've done that, creating your indicator is straightforward.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int newid double Stage1 str16 MessageTime
     614                  . "06-10-2018 11:27"
     388 1848413205000.0002 "06-10-2018 15:32"
    1639      1849260197000 "06-10-2018 20:58"
    1759                  . "06-10-2018 21:54"
     370 1849652584000.0005 "06-10-2018 23:48"
     271                  . "06-10-2018 22:42"
     261                  . "06-10-2018 22:30"
    end
    format %tcnn/dd/ccYY_hh:MM Stage1
    
    //    CREATE A STATA INTERNAL FORMAT CLOCK
    //    VARIABLE OUT OF MessageTime
    gen double message_time = clock(MessageTime, "MDYhm")
    assert missing(message_time) == missing(MessageTime)
    format %tcnn/dd/ccYY_hh:MM_AM message_time Stage1
    
    gen byte wanted = Stage1 > message_time if !missing(Stage1)
    Notes:

    1. It is a bit strange that your Stage1 variable was already a proper Stata clock variable, but your MessageTime variable was a string. It may just be that whoever originally created the data set didn't bother to convert MessageTime for who knows what reason. In that case, just converting it now is the solution. But if this data set was created using Stata's -import excel- command, or StatTransfer, MessageTime should have automatically been created as a Stata clock variable. So this suggests to me that there may be some problems in that variable: observations that simply are not valid date/times. That is the reason I included the -assert- command above. If there are invalid dates, they will trigger a break at the assert command. If that happens, running -browse MessageTime if missing(message_time) != missing(MessageTime)- will show you the offending observations and you can go about figuring out how to fix those problems.

    2. I also changed the display format on both of the time variables. As you had it, there was no indication of AM or PM, and the use of hh in the display format also caused Stata to use 0-12 rather than 24 hour time. So as you had it, you could not visually distinguish 4AM from 4PM, although, internally, Stata could and would do all calculations and sortings correctly. While that did no harm to your calculations, I anticipated that without fixing this display problem, you would next be wondering why Stata might think that 10/27/2018 1:00 is after 10/27/2018 3:00, the confusion arising if the 1:00 is PM and the 3:00 is AM but it doesn't show that way.

    Comment


    • #3
      Hi Clyde,

      Thank you for the quick reply. The reformatting seems to work, but no actual observations are created. After running the code, I get a new variable (message_time) but it's just full of missing values. Is there not something I need to do before formatting (with 'replace' or similar) in order to have values to format?

      Thanks!

      Comment


      • #4
        I have no explanation for this. When running the code I showed with your example data, the variable message_date is created and it has no missing values at all. Your real data must be substantially different from the example you showed, and I would not be able to help you without an example of data where the code actually fails. Please post back with that, and also show the exact code you ran with it, and I'll try to troubleshoot it.

        Note: I can think of one possibility you can check out first. Because in your example data you chose a display format for Stage1 that has the dates showing as month/day/year, I assumed that the MessageDate variable is also to be interpreted that way. That is, in your example data I take "6/10/2018" to refer to June 10, not 6 October. If I have misunderstood that, then any value of MessageDate where the first number in the date exceeds 12 will result in an empty value in message_date because months can't go beyond 12, but days certainly can. So if MessageDate should be understood as 6 October, then change "MDYhm" to "DMYhm".


        Comment


        • #5
          Hi Clyde,

          Thank you for your continued guidance here. In response to your points:
          • Please find attached the actual data here (dataex export below). I've also included the code that I'm running, below, too.
          • You are correct that the date is '6 October'. I changed the format to "DMYhm" (as in the code below) but that doesn't appear to fix things
          Also, when running the code below, as you flagged the assert command may highlight some issues. It did indeed:

          Code:
          . assert missing(message_time) == missing(Week1NotificationTime)
          1,979 contradictions in 1,979 observations
          assertion is false
          How can I go about fixing this? I'm unlikely to be able to get any different data unfortunately.

          Many thanks!

          ---

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str17 Week1NotificationTime double message_time
          "20/10/18 02:57:00" .
          "20/10/18 17:50:00" .
          "20/10/18 13:38:00" .
          "20/10/18 08:02:00" .
          "20/10/18 01:54:00" .
          end
          format %tcnn/dd/ccYY_hh:MM_AM message_time
          Code:
          gen double message_time = clock(Week1NotificationTime, "DMYhm")
          assert missing(message_time) == missing(Week1NotificationTime)
          format %tcnn/dd/ccYY_hh:MM_AM message_time


          Comment


          • #6
            OK. From Stata's perspective, the data shown in #5 are nothing at all like the data you showed in #1. There are two very important differences here. The data in #1 had four-digit years, but in #5 you have only two-digit years. And the data in #1 showed only hours and minutes, whereas the data in #5 also include seconds. For that reason, the "DMYhm" information in the original code, which worked perfectly with the data in #1, is a thoroughly inaccurate description of the strings in #5, and so Stata, unable to parse the strings according to it, responded with all missing values. Do note that the -assert- statement fails when you run what you show in #5: that was Stata's way of telling you that the dates are not correct, as explained in my first note in #1.

            The fix is easy: you just have to give Stata the correct description of the strings:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str17 Week1NotificationTime double message_time
            "20/10/18 02:57:00" .
            "20/10/18 17:50:00" .
            "20/10/18 13:38:00" .
            "20/10/18 08:02:00" .
            "20/10/18 01:54:00" .
            end
            format %tcnn/dd/ccYY_hh:MM_AM message_time
            
            drop message_time
            
            gen double message_time = clock(Week1NotificationTime, "DM20Yhms")
            assert missing(message_time) == missing(Week1NotificationTime)
            format %tcnn/dd/ccYY_hh:MM_AM message_time
            
            list, noobs clean
            which will give you as output
            Code:
            . list, noobs clean
            
                Week1Notificati~e         message_time  
                20/10/18 02:57:00   10/20/2018 2:57 AM  
                20/10/18 17:50:00   10/20/2018 5:50 PM  
                20/10/18 13:38:00   10/20/2018 1:38 PM  
                20/10/18 08:02:00   10/20/2018 8:02 AM  
                20/10/18 01:54:00   10/20/2018 1:54 AM
            Unsolicited advice: it Week1Notification is in the same data set as message time and Stage1, or if they are in data sets that are part of the same project and might be used together, I strongly encourage you to:
            a) Either drop the string versions (assuming you have the original data saved somewhere else) or banish them to the end of the data set (-help order, last-) where they will seldom be seen, and
            b) Adopt a uniform display format for all of the date time variables being used in the project. While Stata will not be confused by different display formats when it does calculations with these variables, any humans working with your data are likely to fall into traps and make mistakes when some dates and times have to be read differently than others. For that matter, if you have to leave this behind for a while and then come back to it later, you might even confuse yourself.

            There are two possible morals to this story, and I'm not sure which applies here.

            1. When posting a question, use example data that is actually drawn from your real data: if you make up other data, it may well differ from the real thing in ways that will break whatever solution code others provide you. To the human brain, the differences between the strings in #1 and #5 are ignorable, and barely even noticeable, but to Stata they are quite complicated. In coding there are no unimportant details.

            2. Perhaps you were simply attempting to reapply the code in #1, which worked there, to a new dataset, or another variable in the same dataset, assuming that you needed only to change the variable name and all would be well. Dates and times are among the most challenging aspects of data management in Stata, and it is very easy to make mistakes. If you are going to be working with data like this on a more than occasional basis, you absolutely need to invest (a lot of) time learning how Stata works with dates and times. There is a lengthy chapter on it in the PDF manuals that come installed with your Stata. You need to read it beginning to end. It's a lengthy read, though, fortunately, written very clearly and with lots of worked examples. But you cannot be productive with dates and times in Stata without learning it. You may find that you quickly forget some aspects of the functions and formats that you rarely use, but at least having once been familiar with them, you will be able to quickly refresh your memory with the help files.

            Comment

            Working...
            X