Split Column

Obaid Ur Rehman

Join Date: May 2019

Posts: 59
#1

Split Column

14 Dec 2022, 22:14

Dear Statalist, i am facing situation in the below data. i want to generate two new variables in the daily date formate (named: "from" and "to") from the below string variable "Fn04003".

is there any way to handle the non-identified characters �� and then split the column in two separate columns ?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str46 Fn04003 "2008-03-21��2008-10-28" "" "" "" "2009-05-26" "" "2008-03-21��2008-10-28" "2008-03-21��2008-10-28" "2008-03-21��2008-10-28" "2009-06-26��2009-06-29" end
Tags: None

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

15 Dec 2022, 01:08

how about,

Code:

gen from = date(substr(Fn04003,1,10),"YMD")
gen to = date(substr(Fn04003,-10,10),"YMD") if length(Fn04003)>10

Comment

Obaid Ur Rehman

Join Date: May 2019
Posts: 59

15 Dec 2022, 01:35

Wow, magical. the code worked perfectly. thanks Øyvind Snilsberg. Below is the outcome

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str46 Fn04003 float(from to)
"2008-03-21��2008-10-28" 17612 17833
""                           .     .
""                           .     .
""                           .     .
"2009-05-26"             18043     .
""                           .     .
"2008-03-21��2008-10-28" 17612 17833
"2008-03-21��2008-10-28" 17612 17833
"2008-03-21��2008-10-28" 17612 17833
"2009-06-26��2009-06-29" 18074 18077
end
format %tdCCYY-NN-DD from
format %tdCCYY-NN-DD to

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35657

15 Dec 2022, 02:26

Oyvind Snilsberg's code worked fine. On how to identify odd characters, consider this dialogue with the example data.

Code:

. charlist Fn04003
-01235689���

. 
. return li 

macros:
              r(chars) : "-01235689���"
           r(sepchars) : "- 0 1 2 3 5 6 8 9 � � � "
              r(ascii) : "45 48 49 50 51 53 54 56 57 189 191 239 "

. 
. chartab Fn04003

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+---------------------------------------
        45       \u002d       -     |            22    HYPHEN-MINUS
        48       \u0030       0     |            33    DIGIT ZERO
        49       \u0031       1     |             8    DIGIT ONE
        50       \u0032       2     |            22    DIGIT TWO
        51       \u0033       3     |             4    DIGIT THREE
        53       \u0035       5     |             1    DIGIT FIVE
        54       \u0036       6     |             4    DIGIT SIX
        56       \u0038       8     |            12    DIGIT EIGHT
        57       \u0039       9     |             4    DIGIT NINE
    65,533       \ufffd       �     |            10    REPLACEMENT CHARACTER
------------------------------------+---------------------------------------

                                    freq. count   distinct
ASCII characters              =             110          9
Multibyte UTF-8 characters    =               0          0
Unicode replacement character =              10          1
Total Unicode characters      =             120         10

charlist from SSC is an old (2002 original) ado that still sometimes is useful. It doesn't really understand anything but up to ASCII 256. It is fine for identifying ASCII 160, which is a common nuisance character.

chartab from SSC (Robert Picard) is the better and more versatile tool that understands Unicode.

Announcement

Comment

Comment

Comment