Identifying different characteristics in a string data and time variable and changes it to numeric

tig som

Join Date: Sep 2022

Posts: 58
#1

Identifying different characteristics in a string data and time variable and changes it to numeric

06 Mar 2023, 06:54

Hi, I am working with a data and time variable.

The variable has DMY for some observations, DMY hms , for some other observations, and then it has DMYhms and some”000” characteristics for the remaining observation.

I give the -dataex- below to show you what it looks like.

. dataex ID date_time

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time 1 "07.12.2010" 2 "04.09.2010" 3 "08.10.2010" 4 "16.01.2010" 5 "07.02.2013 09:08:20" 6 "26.02.2013 16:22:20" 9 "06.05.2013 09:28:26" 8 "23.05.2018 12.24.00,443000000" 9 "24.10.2018 11.32.17,751000000" 10 "29.01.2019 15.10.00,453000000" end

------------------ copy up to and including the previous line ------------------

Listed 10 out of 10 observations

I have a large dataset of more than 120,000. Therefore it is very difficult to identify which observation has which of the three formats., or if there is another format inside.

I would appreciate knowing if there is any STATA command that helps to identify the observations are with each of the characteristics.

The other issue is when I change this string variable to numeric format using the clock function, I get missing values for those observations with “DMY” only.

I used the following command.

gen date_time_N = clock(date_time, "DMY hms#")

format date_time_N %tc

Below I show you the data for the two variables “date_time” and “date_time_N”. In my numeric variable “date_time_N “, all the observations with only “DMY” are missing.

. dataex ID date_time date_time_N

copy starting from the next line ------- ---------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time float date_time_N 1 "07.12.2010" . 2 "04.09.2010" . 3 "08.10.2010" . 4 "16.01.2010" . 5 "07.02.2013 09:08:20" 1.6758473e+12 6 "26.02.2013 16:22:20" 1.677515e+12 9 "06.05.2013 09:28:26" 1.6834517e+12 8 "23.05.2018 12.24.00,443000000" 1.8426975e+12 9 "24.10.2018 11.32.17,751000000" 1.856e+12 10 "29.01.2019 15.10.00,453000000" 1.864394e+12 end format %tc date_time_N

copy up to and including the previous line -- ---------------

Listed 10 out of 10 observations

One solution I used is to generate another numeric variable with the following command

gen date_time_N2 = clock(date_time, "DMY")

format date_time_N2 %tc

The second numeric viable “ date_time_N2 “ has non-missing only for the observation with “DMY”. Below I show with -dataex- what I “date_time_N2” looks like.

. dataex ID date_time date_time_N2

copy starting from the next line ------- ---------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time float date_time_N2 1 "07.12.2010" 1.6072992e+12 2 "04.09.2010" 1.5991776e+12 3 "08.10.2010" 1.6021152e+12 4 "16.01.2010" 1.579219e+12 5 "07.02.2013 09:08:20" . 6 "26.02.2013 16:22:20" . 9 "06.05.2013 09:28:26" . 8 "23.05.2018 12.24.00,443000000" . 9 "24.10.2018 11.32.17,751000000" . 10 "29.01.2019 15.10.00,453000000" . end format %tc date_time_N2

copy up to and including the previous line -- ---------------

Listed 10 out of 10 observations

Then, I combine the two by creating another viable, which is date_time_all, as follows.

gen date_time_N_all = date_time_N

replace date_time_N_all = date_time_N2 if date_time_N_all==. & date_time_N2!=.

. dataex ID date_time date_time_N2 date_time_N_all

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time float(date_time_N2 date_time_N_all) 1 "07.12.2010" 1.6072992e+12 1.6072992e+12 2 "04.09.2010" 1.5991776e+12 1.5991776e+12 3 "08.10.2010" 1.6021152e+12 1.6021152e+12 4 "16.01.2010" 1.579219e+12 1.579219e+12 5 "07.02.2013 09:08:20" . 1.6758473e+12 6 "26.02.2013 16:22:20" . 1.677515e+12 9 "06.05.2013 09:28:26" . 1.6834517e+12 8 "23.05.2018 12.24.00,443000000" . 1.8426975e+12 9 "24.10.2018 11.32.17,751000000" . 1.856e+12 10 "29.01.2019 15.10.00,453000000" . 1.864394e+12 end format %tc date_time_N2 format %tc date_time_N_all

------------------ copy up to and including the previous line ------------------

Listed 10 out of 10 observations

I see the solution I used a bit with a longer step, and I am unsure how precise it will be. I would like to have any other recommendations. Or if there is a danger of using my solution, it would be nice to know. Thank you very much in advance.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#2

06 Mar 2023, 10:35

The only problem with your code is that you must -generate- these clock variables as doubles. You have generated them, by default, as floats.
Comment

Announcement

Identifying different characteristics in a string data and time variable and changes it to numeric

Comment