Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mismathing data when summarizing a variable

    When I try to summarize data the information that the functio returns to me doesn't match with what is disclosed on the Data Editor window. The data was imported from excel, then I converted the variable (which came as a string) to a 'double' type. Below I show some screenshots so it's possible to visualize it:

    Attached Files

  • #2
    It looks like you "...converted the variable (which came as a string) to a 'double' type" using -encode-, a fatal error. Go back to the beginning and convert it using -destring- and you will get sensible results.

    The -encode- command should never be used on string variables that look like numbers.String variables looking like numbers should always be converted to numeric using -destring-; otherwise you get the kind of nonsense shown in #1. -encode- should be used to assign consecutive numbers starting with 1 to the values of string variables that look like names of categories.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Added: Perhaps you already tried -destring- and were told -cannot be converted reversibly; no replace-. So that is what led you to think that perhaps you should try -encode-. -destring- is pretty fussy (as it should be). Most non-integer decimal numbers (including most of the values shown as labels in your example) cannot be exactly represented in binary with a finite number of digits. This is just like 1/3 cannot be represented as a finite decimal number. So -destring- will not allow the conversion of numbers like 4.35 to numeric because to do so would be to lose precision in the low order places. (That is, the nearest possible binary number is not exactly 4.35, it is probably something like 4.34999999999999995.) So -destring- will not do that. There are two ways around this problem, a good way and a bad way.

    The bad, lazy, way is to immediately use the -force- option with -destring-. This will tell -destring- to ignore any problems it encounters and just do its best. That may sound fine. But what if, in addition to these precision issues that are probably too small to matter for any practical purpose anyway, somewhere in your data set one of the values is just an invalid number. Maybe instead of 4.35 somebody typed 4.3.5, or there is something like N/A, or somebody had caps lock on and typed $>#>%. Wouldn't you want to know about that and fix it first.

    So the good way to work around this problem is to run:
    Code:
    list my_string_variable if missing(real(my_string_variable)) & !missing(my_string_variable)
    This will show you all and only the values of your string variable that are problematic by virtue of not being valid numbers--but it won't bother you with precision issues. Then you can fix up the invalid numbers, thereby improving the quality of your data set, and then run -destring- with the -force- option, or, if all of the problems are arising from things like N/A, or the use of commas to separate 3-digit groups, etc., you can deal with those using -destring-'s -ignore()- option, without resorting to the use of -force-.

    (Actually, since it's pretty easy to miss something if the list of invalid numbers needing corrections is long, to be sure you have caught them all, you should run
    Code:
    assert missing(real(my_string_variable)) == missing(my_string_variable)
    as a double-check that there are no longer any invalid numerical entries before proceeding with -destring, force-.)
    Last edited by Clyde Schechter; 05 Mar 2023, 14:52.

    Comment

    Working...
    X