Counting frequency of specific and missing strings

August Anderson

Join Date: Aug 2018

Posts: 5
#1

Counting frequency of specific and missing strings

17 Dec 2018, 06:29

I have a series of variables with dates in string format, including a mis of M/DD/YY, MM-DD-YY, MM/DD/YYYY, MM/YYYY, and Y).
I want to find the number of values with missing or incomplete dates for each variable, so that I can create a frequency table or bar graph.

My first thought was to create a new var whose value is equal to the number of missing values with egen

Code:

egen datebcdxmis = count(datebcdx) if datebcdx == "" | length(datebcdx) < 6 | (length(datebcdx) <= 7 & substr(datebcdx,-5,1) == "/" & substr(datebcdx,-3,1) != "/")

but count doesn't include missing values.

I tried inverting the command above to count the number of non-missing/incomplete dates:

Code:

egen vmis = count(datebcdx) if !(datebcdx == "" | length(datebcdx) < 6 | (length(datebcdx) <= 7 & substr(datebcdx,-5,1) == "/" & substr(datebcdx,-3,1) != "/"))

but I'm not sure how to subtract the result from the total number of values per variable.

What's the simplest way to approach this?
Once I figure it out, I'll need to put it in a loop.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str10(datebcdx datecnsdx) str8 datelmddx "08/2007" "1/29/08" "" " 9/28/2008" "" "" " 7/4/2009" "6/30/09" "6/30/09" "2006" "9/1/09" "9/1/09" " 10/3/2009" "10/3/09" "10/3/09" " 8/14/2006" "9/2/2006" "" " 8/30/2006" "8/31/10" "" " 3/12/2011" "3/12/11" "" " 3/30/2011" "3-30-11" "3-30-11" "07/2009" "" "" end
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35792

17 Dec 2018, 06:45

I'd back up and approach this differently. egen, count() is not where I would start: it's predisposed to count non-missings, which isn't really the main issue here. I see you want to work with the length of the strings, but if you do that then the non-informative leading spaces (and trailing spaces if there are any) are surely best removed. I did this

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10(datebcdx datecnsdx) str8 datelmddx
"08/2007"    "1/29/08"  ""       
" 9/28/2008" ""         ""       
"  7/4/2009" "6/30/09"  "6/30/09"
"2006"       "9/1/09"   "9/1/09" 
" 10/3/2009" "10/3/09"  "10/3/09"
" 8/14/2006" "9/2/2006" ""       
" 8/30/2006" "8/31/10"  ""       
" 3/12/2011" "3/12/11"  ""       
" 3/30/2011" "3-30-11"  "3-30-11"
"07/2009"    ""         ""       
end

quietly foreach v of var date*dx { 
    replace `v' = trim(`v') 
    gen `v'2 = daily(`v', "MDY", 2020) 
    noisily tab `v' if missing(`v'2) , missing 
}

with this result:

Code:

   datebcdx |      Freq.     Percent        Cum.
------------+-----------------------------------
       2006 |          1      100.00      100.00
------------+-----------------------------------
      Total |          1      100.00

  datecnsdx |      Freq.     Percent        Cum.
------------+-----------------------------------
            |          2      100.00      100.00
------------+-----------------------------------
      Total |          2      100.00

  datelmddx |      Freq.     Percent        Cum.
------------+-----------------------------------
            |          6      100.00      100.00
------------+-----------------------------------
      Total |          6      100.00

I am not convinced that new variables are needed here, but naturally your full dataset may be enormously bigger and may require something else.

Comment

August Anderson

Join Date: Aug 2018

Posts: 5
#3

17 Dec 2018, 08:18

This is helpful, thank you!

I don't want to convert the values that are partially missing (eg only contain MM/YYYY) to date format, but they have the same length and same position of "/" from the end as dates in the format M/D/YY (both have "/" at -5), so I added an if clause:

Code:

if (length(datebcdx) >= 7 & substr(datecnsdx,-3,1) == "/") | (length(datebcdx) >= 6 & substr(datebcdx,-5,1) == "/" & substr(datebcdx,-3,1) == "/")

The bolded is meant to prevent MM/YYYY from being converted to date format, but it is not doing so.
Am I counting characters and using logic correctly?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35792
#4

17 Dec 2018, 08:41

I don't know what you mean by "date format". Format in Stata does not mean variable or storage type. It does mean display format.

Nothing stops users bringing their own vocabulary to discussions, but the only terminology that makes sense here without a definition of your own is Stata terminology.

Stata has two ideas in handling dates:

1. Dates are numeric. Years as usually supplied are fine and usually need no special actions. Even in that case, all dates must be numeric to be useful in Stata. Hence you need conversion functions for string dates,.

You have some bare years in string form here; naturally they fail as daily dates without a day and month. Similarly you have some monthly dates in string form, which too will fail as daily dates. In either case, it may make sense to impute missing values.

2. Dates need specific display formats. Dates such as daily and monthly dates must be expressed as integers since the first possible date in 1960 and so should be supplied with display formats so that when seen in output they make sense.

Nothing in the code you show us has anything to with date display formats.

So I don't understand your claim that you are trying to prevent MM/YYYY from being converted to date format. As above, what do you mean by date format? Also, none of the code you show us uses any functions to convert from string to dates.

I think you're hung up on a side-issue, classifying malformed dates. Why is that interesting or useful?
Comment
August Anderson

Join Date: Aug 2018

Posts: 5
#5

17 Dec 2018, 09:26

I agree about eventually imputing missing dates. My goal was to understand how complete the data set is, by separating dates that are recorded completely vs incompletely.

With

Code:

gen `v'2 = daily(`v', "MDY", 2020)

, I'm converting the MM/YYYY strings to dates,, when I wanted to count them as missing, because they're incomplete.
Is there a better way to do categorize and count the string values that I have? Maybe by imputing first and then counting the values that lengthened?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35792
#6

17 Dec 2018, 13:11

Other way round, I think. You can only impute successfully if you know what's missing. There is a discussion of missing day of the month at https://www.stata-journal.com/sjpdf....iclenum=dm0062 Guessing at months if you don't know them is more likely to be intractable.

But the length of the string is a snare and a distraction, e.g. 9/9/09 (6 characters) is good while 10/2000 (7 characters) isn't . It's the number of slashes that is directly informative. See https://www.stata-journal.com/sjpdf....iclenum=dm0056 which is more germane to you than its title implies.

That said, you have four date variables which are likely to be related. If you know (e.g.) that they must obey equalities or inequalities that information provides constraints on what you should do.
1 like
Comment

Announcement