Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to know the exact number of values/letters of a strange variable?

    Hi all,

    I am having difficulties in knowing the exact number of values/letters of a strange variable? For example, the first observation contains 0 and T, and the second observation contains B, P and 0. So, the total values + letters in the first two observations are 0, B, P, and T. I would like to know all of these unique values and letters across observations. Any help is much appreciated. Thanks!

    Code:
    clear
    input str80 var
    "       000000000000000000000000000000000000000000000000000000000000000000000000T"
    "       BPPPPPPPP0000000000000000000000000000000000000000000000BPPPPPPPP000000000"
    "       0000000000000000000BPPPPPPPP000000000000000000000000BPPPPPPPP000000000000"
    "       PPPPPPP00000000BPPPPPPPP0000000000000000000000000BPPPPPPPP000000000000000"
    "       000000000000000000BPPPPPPPP000000000000000000000BPPPPPPPP000000000000000B"
    "       00000000000000000000BPPPPPPPP0000000000000000000000BPPPPPPPP0000000000000"
    "       00000000000000000000000000000000000000BPPPPPPPP00000000000000000000000000"
    "      0000000000000000000000000000000000000000000000000000000000000TP00000000000"
    "      PP0000000000000000000000000000BPPPPPPPP0000000000TPPPPPPPP0000BPPPPPPPP000"
    "      0000000000000000000000000000000NNNNNNNNNNNNNN000000BPPPPPPPP00000000000000"
    end

  • #2
    Are there other possibilities beyond those that occur in the example? I start optimistically with the ideas that

    1. You know the possibilities in advance and the list is short.

    2. You aren't interested in spaces.

    This is standard fooling with standard functions. I count characters by looking for how much the length would decrease if all occurrences were deleted. More at https://journals.sagepub.com/doi/pdf...867X1101100212

    For unique read distinct.

    Code:
    clear
    input str80 var
    "       000000000000000000000000000000000000000000000000000000000000000000000000T"
    "       BPPPPPPPP0000000000000000000000000000000000000000000000BPPPPPPPP000000000"
    "       0000000000000000000BPPPPPPPP000000000000000000000000BPPPPPPPP000000000000"
    "       PPPPPPP00000000BPPPPPPPP0000000000000000000000000BPPPPPPPP000000000000000"
    "       000000000000000000BPPPPPPPP000000000000000000000BPPPPPPPP000000000000000B"
    "       00000000000000000000BPPPPPPPP0000000000000000000000BPPPPPPPP0000000000000"
    "       00000000000000000000000000000000000000BPPPPPPPP00000000000000000000000000"
    "      0000000000000000000000000000000000000000000000000000000000000TP00000000000"
    "      PP0000000000000000000000000000BPPPPPPPP0000000000TPPPPPPPP0000BPPPPPPPP000"
    "      0000000000000000000000000000000NNNNNNNNNNNNNN000000BPPPPPPPP00000000000000"
    end
    
    gen occur = ""
    
    foreach c in 0 B N P T {
        gen count`c' = 80 - strlen(subinstr(var, "`c'", "", .))
        gen any`c' = strpos(var, "`c'") > 0
        replace occur = occur + "`c'" if any`c'
    }  
    
    list occur count* any*
    
         +---------------------------------------------------------------------------------------+
         | occur   count0   countB   countN   countP   countT   any0   anyB   anyN   anyP   anyT |
         |---------------------------------------------------------------------------------------|
      1. |    0T       72        0        0        0        1      1      0      0      0      1 |
      2. |   0BP       55        2        0       16        0      1      1      0      1      0 |
      3. |   0BP       55        2        0       16        0      1      1      0      1      0 |
      4. |   0BP       48        2        0       23        0      1      1      0      1      0 |
      5. |   0BP       54        3        0       16        0      1      1      0      1      0 |
         |---------------------------------------------------------------------------------------|
      6. |   0BP       55        2        0       16        0      1      1      0      1      0 |
      7. |   0BP       64        1        0        8        0      1      1      0      1      0 |
      8. |   0PT       72        0        0        1        1      1      0      0      1      1 |
      9. |  0BPT       45        2        0       26        1      1      1      0      1      1 |
     10. |  0BNP       51        1       14        8        0      1      1      1      1      0 |
         +---------------------------------------------------------------------------------------+
    Last edited by Nick Cox; 17 Nov 2023, 07:56.

    Comment


    • #3
      The number of distinct characters present is naturally given by the length of occur or the observation (row) sum or total of the any*.

      Comment


      • #4
        I guess strange here is autocorrect for string!

        Comment


        • #5
          Dear Nick,

          Thank you so much for your help. My rep to #2 below:
          1) I don't know the possibilities and this is what I am looking for. In fact, I wanted to know the distinct values of the variable. In addition to 0, B, N, P and T, it is possible that var contains several other letters in the full dataset. Is it possible to obtain that?
          2) Yes, I am not interested in spaces.

          Thank you.

          Comment


          • #6
            also, see -chartab-
            Code:
             ssc describe chartab

            Comment


            • #7
              Originally posted by Bjarte Aagnes View Post
              also, see -chartab-
              Code:
               ssc describe chartab
              Thank you so much, Bjarte. This is what I need.

              Comment

              Working...
              X