Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Easily identifying and rectifying different variable types for append/merge

    I am attempting to append/merge several sets of data. These contain about 30-135 variables, many of which are various ID numbers. Frequently, they are stored as float or long in one dataset, and as string in another. This makes the append process very tedious, as I open each dataset in a different window, type "codebook" and manually go through each shared variable to ensure they are the same variable type.

    The best way I can think to speed this up is to "tostring" every variable that is not intended to be a date or amount. I'm not sure if there are unforeseen consequences to doing this. (And I may need to convert dates to strings anyway, as they're saved as strings in one dataset and it seems that destringing them would be more hassle than it may be worth.)

    Is there a faster way to identify conflicts in data types before appending/merging, and are there issues with the idea to make all non-amount variables string that I'm not seeing? I am using Stata 18.5.

    To add to this, "tostring" doesn't always seem to stick. The example below is copy/pasted from my results window, aside from text in brackets.

    Code:
    . append using "[dataset2]"
    variable [var1] is byte in master but str1 in using data
        You could specify append's force option to ignore this numeric/string mismatch.
        The using variable would then be treated as if it contained numeric missing
        value.
    r(106);
    
    . tostring [var1], replace
    [var1] already string; no replace
    
    . append using "[dataset2]"
    variable [var1] is byte in master but str1 in using data
        You could specify append's force option to ignore this numeric/string mismatch.
        The using variable would then be treated as if it contained numeric missing
        value.
    r(106);
    Thanks!
    Last edited by Ky Neal; 07 Sep 2025, 10:37. Reason: Added issue with "tostring" not seeming to work.

  • #2
    Mark Chatfield's -precombine- package (available from Stata journal, -net install dm0081.pkg-) will examine the data sets and identify all incompatible variables like this. It will also tell you if you have value-labeled numeric variables whose value labels in the data sets do not completely agree. It's a great utility and speeds up this kind of work enormously. You can then fix both data sets before trying to combine (whether through -merge- or -append- or -joinby-) them.

    And I may need to convert dates to strings anyway, as they're saved as strings in one dataset and it seems that destringing them would be more hassle than it may be worth.
    No, no, no! You should never -destring- a string variable that represents a date. That will just give you garbage. You have to convert them to Stata numeric date variables using one of Stata's functions, -daily()-, -clock()-, -monthly()-, etc. according to its content. If you try to -destring- something like "7sep2025" you will just get missing values because "7sep2025" is not the string representation of any number. If you try to -destring- something like "07092025" you will get numeric value 7092025, which is not a usable numeric representation of the date 7 sep 2025. Actually, something like "07092025" is your worst case scenario because you will first have to use the -substr()- function to break that up into "07", "09", and "2025", then -destring- those, and then use -mdy()- to get a Stata numeric date. If you are confronting that kind of string representation of a date, the simplest way deal with that is with Nick Cox's -numdate- command, available from SSC.

    The best way I can think to speed this up is to "tostring" every variable that is not intended to be a date or amount.
    It depends on the specific contents of the two data sets, but I would say that only occasionally will this really be a good way to go. Numeric variables, in Stata, are represented in binary. Just as not all numbers can be represented as a finite string of decimal digits (e.g. 1/3), not all numbers can be represented as a finite string of binary digits. Consequently Stata's internal representation of (non-integer) number are often approximations. High order approximations, but approximations nonetheless. And -tostring- may convert them in ways that you don't like. Also -tostring- may recognize that it is dealing with such a situation and refuse to make the conversion at all to avoid loss of precision. You are probably better off converting string variables to numeric.

    One marked exception to the above advice is with value labeled numeric variables that have conflicting value labels in the two data sets. (-precombine- will alert you to these.) In that situation, you should use -decode- to convert them to string variables, and then, after the two data sets are combined, use -encode- to convert that string variable into a new value-labeled numeric variable with a single, reconciled, value label.

    Comment


    • #3
      To add to this, "tostring" doesn't always seem to stick. The example below is copy/pasted from my results window, aside from text in brackets.

      . append using "[dataset2]"
      variable [var1] is byte in master but str1 in using data
      You could specify append's force option to ignore this numeric/string mismatch.
      The using variable would then be treated as if it contained numeric missing
      value.
      r(106);

      . tostring [var1], replace
      [var1] already string; no replace
      This is very strange, and I have never seen this happen before. If -append- thinls [var1] is byte in the master date, and if your very next command really is -tostring [var1], replace-, it should definitely not now tell you that [var1] is already string! Are you sure there were no commands between these that might have converted [var1]? Or swapped the data sets in memory? Can you post example data that reproduces this situation? (Use the -dataex- command to do that. Run -help dataex- if you are not familiar with it.)

      Comment

      Working...
      X