Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • method to check if two variables are identical

    Hi everyone,

    I am looking for a method to check if two (or more) variables are identical to each other (i.e. if the content is identical), independently of the data type (float, byte, str, etc.). I could do the check visually, but on a database with more than 7 million observations, I can't.

    Does anyone have a little trick for this? It will save me a lot of time (and less gray hair by the way).

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long i float t byte c str2 c_cod
    1415 724 10 "FR"
    1415 725 10 "FR"
    1415 726 10 "FR"
    1415 727 10 "FR"
    1415 728 10 "FR"
    1415 729 10 "FR"
    1415 730 10 "FR"
    1415 731 10 "FR"
    1415 732 10 "FR"
    1415 733 10 "FR"
    1415 734 10 "FR"
    1415 735 10 "FR"
    1415 736 10 "FR"
    1415 737 10 "FR"
    1477 712  2 "BE"
    1477 713  2 "BE"
    1477 714  2 "BE"
    1477 727  2 "BE"
    1477 728  2 "BE"
    1477 729  2 "BE"
    1477 730  2 "BE"
    1477 731  2 "BE"
    1477 732  2 "BE"
    1477 733  2 "BE"
    1477 734  2 "BE"
    end
    format %tm t
    label values c clab
    label def clab 2 "BE", modify
    label def clab 10 "FR", modify
    label def clab 22 "VD", modify
    For my variable c, the code is defined as follows. 10 is equivalent to FR, and 2 to BE. In theory, the content in `c` is supposed to be "the same" as the one in `c_cod`.
    Thank you in advance!

    Best regards,

  • #2
    I don't have a formal code to suggest, but if I had to check I'd collapse and check for duplicate:

    Code:
    collapse (count) i, by(c c_cod)
    duplicates report c
    duplicates report c_cod
    If they are unique pair, then after aggregation both of them should be unique.

    Comment


    • #3
      Ken Chui : Thank you very much! I will try the code and keep you posted.

      Thank you again.

      Michael

      Comment


      • #4
        Hi Ken Chui:

        I confirm, your methods works well.
        Thanks again.

        Best,

        Michael

        Comment


        • #5
          See also an FAQ on this topic at https://www.stata.com/support/faqs/d...ions-in-group/

          Here is some code:

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long i float t byte c str2 c_cod
          1415 724 10 "FR"
          1415 725 10 "FR"
          1415 726 10 "FR"
          1415 727 10 "FR"
          1415 728 10 "FR"
          1415 729 10 "FR"
          1415 730 10 "FR"
          1415 731 10 "FR"
          1415 732 10 "FR"
          1415 733 10 "FR"
          1415 734 10 "FR"
          1415 735 10 "FR"
          1415 736 10 "FR"
          1415 737 10 "FR"
          1477 712  2 "BE"
          1477 713  2 "BE"
          1477 714  2 "BE"
          1477 727  2 "BE"
          1477 728  2 "BE"
          1477 729  2 "BE"
          1477 730  2 "BE"
          1477 731  2 "BE"
          1477 732  2 "BE"
          1477 733  2 "BE"
          1477 734  2 "BE"
          end
          format %tm t
          label values c clab
          label def clab 2 "BE", modify
          label def clab 10 "FR", modify
          label def clab 22 "VD", modify
          
          bysort c (c_cod) : gen diff = c_cod[1] != c_cod[_N]
          
          list if diff
          and for the data example there is no output, which is fine. No news is good news.

          Comment


          • #6
            In #1, changing

            Code:
            label def clab 10 "FR", modify
            label def clab 22 "VD", modify
            to

            Code:
            label def clab 22 "FR", modify
            label def clab 10 "VD", modify
            will not make any difference to the code suggested in #2 (Edit: or #5). The code only checks whether both variables define the same groups; it does nothing to check the underlying values. Whether that is a problem is not clear because the initial post does not define what is meant by "identical". By definition, string variables and numeric variables cannot be identical. Does identical mean that values of c_cond must not vary within values of c (and the other way round) or do the value labels have to match the string values, too?
            Last edited by daniel klein; 05 May 2023, 01:42.

            Comment


            • #7
              Hi Nick Cox: Thanks for the nice suggestion and for the stata FAQ resource.

              Hi daniel klein: I apologize for the confusion. By identical I mean option 1 that you suggested: that the values of c_cond should not vary within the values of c (and vice versa).

              Best,

              Michael

              Comment


              • #8
                daniel klein has a great point, but I think the close of #1 is a better guide to the real question than the title of the thread. I read it as whether the values of two variables are in one-to-one correspondence. There is certainly no sense to saying that "BE" is identical to (*) 2.

                (*) or "identical with", a phrasing often commended that sounds very strange to me, although English is my first language.

                EDIT Crossed with #7, which confirms the guess here.
                Last edited by Nick Cox; 05 May 2023, 03:27.

                Comment

                Working...
                X