Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String manipulation (loop over letters in a string)

    I'd appreciate advice on how to loop over letters in a string in Stata. I understand it can be done in Excel, but I thought it'd be nice to learn how to do it in Stata. The problem comes from generating a check digit for ISIN numbers. For example, suppose the 11-digit ISIN is US037833100 and I want to generate a check digit following the algorithm below (copied from http://stackoverflow.com/questions/3...nto-isin-codes):
    1. Convert any letters to numbers: U = 30, S = 28. US037833100 -> 3028037833100.
    2. Collect odd and even characters: 3028037833100 = (3, 2, 0, 7, 3, 1, 0), (0, 8, 3, 8, 3, 0)
    3. Multiply the group containing the rightmost character (which is the FIRST group) by 2: (6, 4, 0, 14, 6, 2, 0)
    4. Add up the individual digits: (6 + 4 + 0 + (1 + 4) + 6 + 2 + 0) + (0 + 8 + 3 + 8 + 3 + 0) = 45
    5. Take the 10s modulus of the sum: 45 mod 10 = 5
    6. Subtract from 10: 10 - 5 = 5
    7. Take the 10s modulus of the result (this final step is important in the instance where the modulus of the sum is 0, as the resulting check digit would be 10). 5 mod 10 = 5
    So the ISIN check digit is 5.
    Thanks in advance for any input!

  • #2
    There was a somewhat similar question relating to SEDOL identifiers a while back on the forum. The details were different but the general issues were the same.

    From a quick Google search I gather that all ISIN codes are 12 characters long, with the first two being alphabetic and the last ten numeric, the last one being the check digit you want. I will presume that this is correct in what follows below. Let's call the variable containing the code isin11
    Code:
    clear*
    input str11 isin11
    US037833100
    end
    
    local spaceless_alphabet `c(ALPHA)'
    local spaceless_alphabet: subinstr local spaceless_alphabet " " "", all
    display "`spaceless_alphabet'"
    
    // VERIFY ISIN IS 11 CHARACTERS LONG
    assert length(isin11) == 11
    
    // CREATE 13 SINGLE DIGITS VARIABLES TO HOLD THE WORKING DIGITS
    
    // THE FIRST TWO COME FROM POSITION OF LETTERS IN THE ALPHABET
    gen pos1 = strpos("`spaceless_alphabet'", substr(isin11, 1, 1)) + 9
    gen char1 = floor(pos1/10)
    gen char2 = mod(pos1, 10)
    gen pos2 = strpos("`spaceless_alphabet'", substr(isin11, 2, 1)) + 9
    gen char3 = floor(pos2/10)
    gen char4 = mod(pos2, 10)
    assert inrange(char1, 1, 3) & inrange(char3, 1, 3) & inrange(char2, 0, 9) & inrange(char4, 0, 9)
    forvalues i = 3/11 {
        local j = `i'+2
        gen char`j' = substr(isin11, `i', 1)
    }
    destring char5-char13, replace
    
    // DOUBLE THE ODD NUMBEED CHARACTERS
    forvalues i = 1(2)13 {
        replace char`i' = 2*char`i'
    }
    
    // ADD UP THE DIGITS OF ALL OF THESE
    gen checksum = 0
    forvalues i = 1/13 {
        replace checksum = checksum + mod(char`i', 10) + floor(char`i'/10)
    }
    
    // REDUCE MODULO 10 AND SUBTRACT FROM 10
    replace checksum = 10 - mod(checksum, 10)
    
    // AND REDUCE THAT MODULO 10
    replace checksum = mod(checksum, 10)
    NOTES: 1. This works correctly for your example. I have not tested it beyond that, but I believe it is right.

    2. Whereas I explicitly verify that the code contains 11 characters and the first two digits do convert to numbers properly, I didn't explicitly verify that the remaining characters are all digits. But the -destring- command will complain and abort if that fails.

    3. One caveat: I'm not sure I fully understand the ISIN process for converting letters to numbers. I would have expected A = 1, B = 2, ... Z = 26, but apparently they start with A = 10, B = 11, etc. So I just added 9 to the alphabetic location of the letter. I hope that's what they mean by that.


    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      There was a somewhat similar question relating to SEDOL identifiers a while back on the forum. The details were different but the general issues were the same.

      From a quick Google search I gather that all ISIN codes are 12 characters long, with the first two being alphabetic and the last ten numeric, the last one being the check digit you want. I will presume that this is correct in what follows below. Let's call the variable containing the code isin11
      Code:
      clear*
      input str11 isin11
      US037833100
      end
      
      local spaceless_alphabet `c(ALPHA)'
      local spaceless_alphabet: subinstr local spaceless_alphabet " " "", all
      display "`spaceless_alphabet'"
      
      // VERIFY ISIN IS 11 CHARACTERS LONG
      assert length(isin11) == 11
      
      // CREATE 13 SINGLE DIGITS VARIABLES TO HOLD THE WORKING DIGITS
      
      // THE FIRST TWO COME FROM POSITION OF LETTERS IN THE ALPHABET
      gen pos1 = strpos("`spaceless_alphabet'", substr(isin11, 1, 1)) + 9
      gen char1 = floor(pos1/10)
      gen char2 = mod(pos1, 10)
      gen pos2 = strpos("`spaceless_alphabet'", substr(isin11, 2, 1)) + 9
      gen char3 = floor(pos2/10)
      gen char4 = mod(pos2, 10)
      assert inrange(char1, 1, 3) & inrange(char3, 1, 3) & inrange(char2, 0, 9) & inrange(char4, 0, 9)
      forvalues i = 3/11 {
      local j = `i'+2
      gen char`j' = substr(isin11, `i', 1)
      }
      destring char5-char13, replace
      
      // DOUBLE THE ODD NUMBEED CHARACTERS
      forvalues i = 1(2)13 {
      replace char`i' = 2*char`i'
      }
      
      // ADD UP THE DIGITS OF ALL OF THESE
      gen checksum = 0
      forvalues i = 1/13 {
      replace checksum = checksum + mod(char`i', 10) + floor(char`i'/10)
      }
      
      // REDUCE MODULO 10 AND SUBTRACT FROM 10
      replace checksum = 10 - mod(checksum, 10)
      
      // AND REDUCE THAT MODULO 10
      replace checksum = mod(checksum, 10)
      NOTES: 1. This works correctly for your example. I have not tested it beyond that, but I believe it is right.

      2. Whereas I explicitly verify that the code contains 11 characters and the first two digits do convert to numbers properly, I didn't explicitly verify that the remaining characters are all digits. But the -destring- command will complain and abort if that fails.

      3. One caveat: I'm not sure I fully understand the ISIN process for converting letters to numbers. I would have expected A = 1, B = 2, ... Z = 26, but apparently they start with A = 10, B = 11, etc. So I just added 9 to the alphabetic location of the letter. I hope that's what they mean by that.

      Want to know this too, and Clyde give a nice solution. By the way, I also want to ask how to "desasonalize seasonal data" or what does "data is deseasonalized" mean?
      Last edited by 高佳; 24 Jan 2016, 20:39.

      Comment


      • #4
        I also want to ask how to "desasonalize seasonal data" or what does "data is deseasonalized" mean?
        That question is at best tangentially related to the original topic of this thread. Please start a new post when raising a new topic.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post

          That question is at best tangentially related to the original topic of this thread. Please start a new post when raising a new topic.
          Thanks for your clarification Clyde, I'll start a new one.

          Comment


          • #6
            Clyde, many thanks for the solution!

            Comment

            Working...
            X