Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Efficiency" and "Accuracy" when appending datasets.

    I'd like to throw out a topic for discussion that perhaps doesn't have an exact answer (or perhaps it's quite elementary).

    I am in the progress of, and also stuck in, a data appending task. However, the datasets were from different sources and many of them have apparently been processed by STATA, and therefore stored in various 'storage types'.

    For example, dataset A and B have the following two variables but with different storage types:

    Dataset A
    Variable name Storage type Display format
    servicelocation byte %33.0g
    dob double %td..
    Dataset B
    Variable name Storage type Display format
    servicelocation str18 %18s
    dob float %tc
    I must admit that dealing with date data is a real pain. Especially as some data is defined in milliseconds (e.g., stored as 715305600000 and displayed as 03jan1993 00:00:00) and some in days (e.g., stored as 15352 and displayed as 12jan2002).

    Certainly, I can use many commands/functions such as recast, format, floor() to manipulate the storage type and/or display format in order to appropriately append B to A. I found that when appending data, STATA is particularly strict about the formatting of the data, and it is easy for a type mismatch to result in a null value.

    The problem is that both datasets have more than 500 variables. I may be able to check and reformat 5-8 important variables in order to achieve the correct data append, while it would be unrealistic to check all variables.

    Questions:

    I have an idea that I don't know if it's right (it might sound a bit speculative). As both datasets are complete in terms of the content (i.e. no corrections need to be made). Can I export the datasets to .csv format, then re-import them back to STATA? So, all variables will be converted to string format, and all datasets can be appended smoothly. Certainly, I need to convert the date and other numeric variables back to the sensible formats (e.g., int, float...).

    In the original datasets, most variables were indeed numeric (e.g., 0, 1, 2...) at the backend, but coded using value labels (e.g., 0: Sydney; 1: Canberra...). In the above scenarios regardless whether using original datasets (with value labels), or re-import from csv (no value labels), if I tabulate the variable (e.g., tab servicelocation), it shows:
    Service Location Freq. Precent Cum.
    Sydney 1,321 34.49 34.49
    Canberra 377 6.61 41.10
    ... ... ... ...
    It seems that even when certain categorical variables are stored in string format (red color), they do not seem to affect the analysis.

    So, my final question is: regardless of the date and numeric variables, does it cause any problems if the background values of other variables are erased and only the categories defined by "value labels" are retained?

    Thanks.


  • #2
    Sorry, allow me to add a bit more.

    The point I'm struggling with is that I don't want to spend too much time dealing with data that can't be appended due to inconsistent data formats.

    That's why I thought of a more “efficient” way to polish the data format at once, and then fine-tune a few data formats (e.g., date). My concern was that whether this would affect the subsequent analysis or the "accuracy".

    Comment


    • #3
      I personally don't see any advantage of converting files to CSV, and then re-importing them to Stata format.

      I also don't agree that analysis is unaffected by whether a variable is stored as numeric with value labels, or as a string variable. Tabulation is one of the few places where the results may look similar, but numerical variables lend themselves to many types of analysis that string variables do not (and vice-versa).

      I would suggest a set of procedures to harmonise your data before appending.

      For instance, if you have a categorical variable that is numeric in some datasets (with value labels) and string in others and/or if it is stored numerically with inconsistent value labels across datasets, then you might want to convert such variables to strings before appending.

      Suppose you want to do this for all variables with value labels, you could do something like:

      Code:
      ds , has(vallabel)
      local labeled_vars `r(varlist)'
      
      foreach var of local labeled_vars {
          tempvar temp
          decode `var', gen(`temp')
          drop `var'
          rename `temp' `var'
      }
      If you know that the time in your datetime variables is not needed, you can convert them all to daily date variables as follows:

      Code:
      ds, has(format %tc*)
      local timevars `r(varlist)'
      
      foreach var of local timevars {
          replace `var' = dofc(`var')
          format `var' %td
      }
      Last edited by Hemanshu Kumar; 26 Jun 2023, 03:45.

      Comment


      • #4
        Thank you, @Hemanshu. I will definitely consider you advice.

        Comment

        Working...
        X