Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting frequency of specific and missing strings

    I have a series of variables with dates in string format, including a mis of M/DD/YY, MM-DD-YY, MM/DD/YYYY, MM/YYYY, and Y).
    I want to find the number of values with missing or incomplete dates for each variable, so that I can create a frequency table or bar graph.

    My first thought was to create a new var whose value is equal to the number of missing values with egen
    Code:
    egen datebcdxmis = count(datebcdx) if datebcdx == "" | length(datebcdx) < 6 | (length(datebcdx) <= 7 & substr(datebcdx,-5,1) == "/" & substr(datebcdx,-3,1) != "/")
    but count doesn't include missing values.

    I tried inverting the command above to count the number of non-missing/incomplete dates:
    Code:
    egen vmis = count(datebcdx) if !(datebcdx == "" | length(datebcdx) < 6 | (length(datebcdx) <= 7 & substr(datebcdx,-5,1) == "/" & substr(datebcdx,-3,1) != "/"))
    but I'm not sure how to subtract the result from the total number of values per variable.

    What's the simplest way to approach this?
    Once I figure it out, I'll need to put it in a loop.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str10(datebcdx datecnsdx) str8 datelmddx
    "08/2007"    "1/29/08"  ""       
    " 9/28/2008" ""         ""       
    "  7/4/2009" "6/30/09"  "6/30/09"
    "2006"       "9/1/09"   "9/1/09" 
    " 10/3/2009" "10/3/09"  "10/3/09"
    " 8/14/2006" "9/2/2006" ""       
    " 8/30/2006" "8/31/10"  ""       
    " 3/12/2011" "3/12/11"  ""       
    " 3/30/2011" "3-30-11"  "3-30-11"
    "07/2009"    ""         ""       
    end

  • #2
    I'd back up and approach this differently. egen, count() is not where I would start: it's predisposed to count non-missings, which isn't really the main issue here. I see you want to work with the length of the strings, but if you do that then the non-informative leading spaces (and trailing spaces if there are any) are surely best removed. I did this


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str10(datebcdx datecnsdx) str8 datelmddx
    "08/2007"    "1/29/08"  ""       
    " 9/28/2008" ""         ""       
    "  7/4/2009" "6/30/09"  "6/30/09"
    "2006"       "9/1/09"   "9/1/09" 
    " 10/3/2009" "10/3/09"  "10/3/09"
    " 8/14/2006" "9/2/2006" ""       
    " 8/30/2006" "8/31/10"  ""       
    " 3/12/2011" "3/12/11"  ""       
    " 3/30/2011" "3-30-11"  "3-30-11"
    "07/2009"    ""         ""       
    end
    
    quietly foreach v of var date*dx { 
        replace `v' = trim(`v') 
        gen `v'2 = daily(`v', "MDY", 2020) 
        noisily tab `v' if missing(`v'2) , missing 
    }
    with this result:

    Code:
       datebcdx |      Freq.     Percent        Cum.
    ------------+-----------------------------------
           2006 |          1      100.00      100.00
    ------------+-----------------------------------
          Total |          1      100.00
    
      datecnsdx |      Freq.     Percent        Cum.
    ------------+-----------------------------------
                |          2      100.00      100.00
    ------------+-----------------------------------
          Total |          2      100.00
    
      datelmddx |      Freq.     Percent        Cum.
    ------------+-----------------------------------
                |          6      100.00      100.00
    ------------+-----------------------------------
          Total |          6      100.00
    I am not convinced that new variables are needed here, but naturally your full dataset may be enormously bigger and may require something else.

    Comment


    • #3
      This is helpful, thank you!

      I don't want to convert the values that are partially missing (eg only contain MM/YYYY) to date format, but they have the same length and same position of "/" from the end as dates in the format M/D/YY (both have "/" at -5), so I added an if clause:
      Code:
      if (length(datebcdx) >= 7 & substr(datecnsdx,-3,1) == "/") | (length(datebcdx) >= 6 & substr(datebcdx,-5,1) == "/" & substr(datebcdx,-3,1) == "/")
      The bolded is meant to prevent MM/YYYY from being converted to date format, but it is not doing so.
      Am I counting characters and using logic correctly?

      Comment


      • #4
        I don't know what you mean by "date format". Format in Stata does not mean variable or storage type. It does mean display format.

        Nothing stops users bringing their own vocabulary to discussions, but the only terminology that makes sense here without a definition of your own is Stata terminology.

        Stata has two ideas in handling dates:

        1. Dates are numeric. Years as usually supplied are fine and usually need no special actions. Even in that case, all dates must be numeric to be useful in Stata. Hence you need conversion functions for string dates,.

        You have some bare years in string form here; naturally they fail as daily dates without a day and month. Similarly you have some monthly dates in string form, which too will fail as daily dates. In either case, it may make sense to impute missing values.

        2. Dates need specific display formats. Dates such as daily and monthly dates must be expressed as integers since the first possible date in 1960 and so should be supplied with display formats so that when seen in output they make sense.

        Nothing in the code you show us has anything to with date display formats.

        So I don't understand your claim that you are trying to prevent MM/YYYY from being converted to date format. As above, what do you mean by date format? Also, none of the code you show us uses any functions to convert from string to dates.

        I think you're hung up on a side-issue, classifying malformed dates. Why is that interesting or useful?

        Comment


        • #5
          I agree about eventually imputing missing dates. My goal was to understand how complete the data set is, by separating dates that are recorded completely vs incompletely.

          With
          Code:
          gen `v'2 = daily(`v', "MDY", 2020)
          , I'm converting the MM/YYYY strings to dates,, when I wanted to count them as missing, because they're incomplete.
          Is there a better way to do categorize and count the string values that I have? Maybe by imputing first and then counting the values that lengthened?

          Comment


          • #6
            Other way round, I think. You can only impute successfully if you know what's missing. There is a discussion of missing day of the month at https://www.stata-journal.com/sjpdf....iclenum=dm0062 Guessing at months if you don't know them is more likely to be intractable.

            But the length of the string is a snare and a distraction, e.g. 9/9/09 (6 characters) is good while 10/2000 (7 characters) isn't . It's the number of slashes that is directly informative. See https://www.stata-journal.com/sjpdf....iclenum=dm0056 which is more germane to you than its title implies.

            That said, you have four date variables which are likely to be related. If you know (e.g.) that they must obey equalities or inequalities that information provides constraints on what you should do.

            Comment

            Working...
            X