Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data editor - track changes by unique identifier

    Hello,

    This is my first post and despite my best efforts, I could not find an answer online or on this forum. If there is a link to a forum post already answered that would be very helpful.

    I have imported an Excel file of a recent study. It requires a significant amount of data cleaning. A proportion of this data cleaning will require manual corrections using the data editor. In order for my work to be transparent and reproducible. Is it possible to track each observation change by a unique identifier variable (e.g., Patient 1 or Patient 2 etc.)

    So for example, if I change an observation in variable (var) to 99 in the data editor, the output window log may read:

    replace var = 99 in 1

    The 1 reflects the stata observation line, but this may change etc if sort used.

    Is it possible to change the "1" to a unique identifier variable that may be tracked more easily? Or are people tracking these changes manually on a log file.

    Apologies if this is unclear,

    David

  • #2
    Is it possible to change the "1" to a unique identifier variable that may be tracked more easily?
    As far as I know, it is not possible when you are making the changes directly in the editor.


    Or are people tracking these changes manually on a log file.
    It is likely that many people are doing this. It is not a good idea. It is too easy to record things incorrectly or to omit changes.

    What I do is, I don't make changes in the editor; I do it with commands in a do-file. So, instead of manually replacing the value of var with 99 in observation 1 in the editor, if the value of the unique identifier variable, id, is, say, ID1234, I put this command in my cleaning do-file:
    Code:
    replace var = 99 if id == "ID1234"
    This can be tedious if you are replacing var = 99 in a large number of observations corresponding to a somewhat idiosyncratic list of values of id. When I encounter a situation like that, I create a little data set consisting of just the id's for which this change needs to be made. I might give that data set a name like var_to_99_ids.dta. Then in my do-file I would code:
    Code:
    merge 1:1 id using var_to_99_ids, assert(master match)
    replace var = 99 if _merge == 3
    drop _merge
    When all is said and done, there is no really convenient way to clean data sets that require a lot of idiosyncratic editing. To the extent that you can, in the future, avoid getting data from that source!

    By the way, I congratulate you for even caring about this issue. Too many people regard data management (including cleaning) as a nuisance and beneath their dignity, and don't take care to do it properly. Then they go ahead and analyze data sets applying fancy statistical procedures to incorrect data, sometimes unaware that this is even an issue. And when they eventually get results that are patently wrong, they are stuck as to how to fix it because they don't even know what they did. So, good on you for caring.

    Comment


    • #3
      Thank you very much, this is helpful and transparent.

      Also, thank you for your kind words. I agree with your comments about cleaning.

      I think I have a suggestion for the next STATA upgrade.

      All the best,

      David

      Comment

      Working...
      X