Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying different characteristics in a string data and time variable and changes it to numeric


    Hi, I am working with a data and time variable.







    The variable has DMY for some observations, DMY hms , for some other observations, and then it has DMYhms and some”000” characteristics for the remaining observation.

    I give the -dataex- below to show you what it looks like.


    . dataex ID date_time

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float ID str29 date_time
     1 "07.12.2010"                   
     2 "04.09.2010"                   
     3 "08.10.2010"                   
     4 "16.01.2010"                   
     5 "07.02.2013 09:08:20"          
     6 "26.02.2013 16:22:20"          
     9 "06.05.2013 09:28:26"          
     8 "23.05.2018 12.24.00,443000000"
     9 "24.10.2018 11.32.17,751000000"
    10 "29.01.2019 15.10.00,453000000"
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 10 out of 10 observations





    I have a large dataset of more than 120,000. Therefore it is very difficult to identify which observation has which of the three formats., or if there is another format inside.

    1. I would appreciate knowing if there is any STATA command that helps to identify the observations are with each of the characteristics.
    1. The other issue is when I change this string variable to numeric format using the clock function, I get missing values for those observations with “DMY” only.



    I used the following command.




    gen date_time_N = clock(date_time, "DMY hms#")

    format date_time_N %tc




    Below I show you the data for the two variables “date_time” and “date_time_N”. In my numeric variable “date_time_N “, all the observations with only “DMY” are missing.


    . dataex ID date_time date_time_N

    copy starting from the next line ------- ---------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float ID str29 date_time float date_time_N
    1 "07.12.2010"                                .
    2 "04.09.2010"                                .
    3 "08.10.2010"                                .
    4 "16.01.2010"                                .
    5 "07.02.2013 09:08:20"           1.6758473e+12
    6 "26.02.2013 16:22:20"            1.677515e+12
    9 "06.05.2013 09:28:26"           1.6834517e+12
    8 "23.05.2018 12.24.00,443000000" 1.8426975e+12
    9 "24.10.2018 11.32.17,751000000"     1.856e+12
    10 "29.01.2019 15.10.00,453000000"  1.864394e+12
    end
    format %tc date_time_N
    copy up to and including the previous line -- ---------------

    Listed 10 out of 10 observations



    One solution I used is to generate another numeric variable with the following command

    gen date_time_N2 = clock(date_time, "DMY")

    format date_time_N2 %tc

    The second numeric viable “ date_time_N2 “ has non-missing only for the observation with “DMY”. Below I show with -dataex- what I “date_time_N2” looks like.




    . dataex ID date_time date_time_N2

    copy starting from the next line ------- ---------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float ID str29 date_time float date_time_N2
    1 "07.12.2010"                    1.6072992e+12
    2 "04.09.2010"                    1.5991776e+12
    3 "08.10.2010"                    1.6021152e+12
    4 "16.01.2010"                     1.579219e+12
    5 "07.02.2013 09:08:20"                       .
    6 "26.02.2013 16:22:20"                       .
    9 "06.05.2013 09:28:26"                       .
    8 "23.05.2018 12.24.00,443000000"             .
    9 "24.10.2018 11.32.17,751000000"             .
    10 "29.01.2019 15.10.00,453000000"             .
    end
    format %tc date_time_N2
    copy up to and including the previous line -- ---------------

    Listed 10 out of 10 observations



    Then, I combine the two by creating another viable, which is date_time_all, as follows.




    gen date_time_N_all = date_time_N




    replace date_time_N_all = date_time_N2 if date_time_N_all==. & date_time_N2!=.



    . dataex ID date_time date_time_N2 date_time_N_all

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float ID str29 date_time float(date_time_N2 date_time_N_all)
     1 "07.12.2010"                    1.6072992e+12 1.6072992e+12
     2 "04.09.2010"                    1.5991776e+12 1.5991776e+12
     3 "08.10.2010"                    1.6021152e+12 1.6021152e+12
     4 "16.01.2010"                     1.579219e+12  1.579219e+12
     5 "07.02.2013 09:08:20"                       . 1.6758473e+12
     6 "26.02.2013 16:22:20"                       .  1.677515e+12
     9 "06.05.2013 09:28:26"                       . 1.6834517e+12
     8 "23.05.2018 12.24.00,443000000"             . 1.8426975e+12
     9 "24.10.2018 11.32.17,751000000"             .     1.856e+12
    10 "29.01.2019 15.10.00,453000000"             .  1.864394e+12
    end
    format %tc date_time_N2
    format %tc date_time_N_all
    ------------------ copy up to and including the previous line ------------------

    Listed 10 out of 10 observations






    I see the solution I used a bit with a longer step, and I am unsure how precise it will be. I would like to have any other recommendations. Or if there is a danger of using my solution, it would be nice to know. Thank you very much in advance.



  • #2
    The only problem with your code is that you must -generate- these clock variables as doubles. You have generated them, by default, as floats.

    Comment

    Working...
    X