Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to do the count and data conversion of the number of strings in the following data

    I have a set of data, the existing data is as follows:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15 var1
    "a"              
    "a b"            
    "a b c d"        
    "a b c d e f"    
    "a b c d e f g h"
    "a b c"          
    "a b c d"        
    "a b"            
    "a"              
    "a b c d e f"    
    end

    The two existing problems are as follows:
    Question 1. You need to know the number of letters in the string in the var1 variable in each record (or the number of spaces, which are separated by spaces and have no spaces at the beginning and the end), how to find, except for the moss command, I have already used this method made it

    Question 2. How to convert the data into the following data? Is there a good way to realize data manipulation?

    The target data is as follows:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float group2 str1 var2
     1 "a"
     2 "a"
     2 "b"
     3 "a"
     3 "b"
     3 "c"
     3 "d"
     4 "a"
     4 "b"
     4 "c"
     4 "d"
     4 "e"
     4 "f"
     5 "a"
     5 "b"
     5 "c"
     5 "d"
     5 "e"
     5 "f"
     5 "g"
     5 "h"
     6 "a"
     6 "b"
     6 "c"
     7 "a"
     7 "b"
     7 "c"
     7 "d"
     8 "a"
     8 "b"
     9 "a"
    10 "a"
    10 "b"
    10 "c"
    10 "d"
    10 "e"
    10 "f"
    end
    How to do the above data conversion?
    Thank you very much .

  • #2
    Code:
    //  COUNT NUMBER OF SPACES (ADD 1 FOR THE NUMBER OF TOKENS)
    gen long number_of_spaces = strlen(var1) - strlen(subinstr(var1, " ", "", .))
    
    //  SPLIT AND GO LONG
    split var1, gen(c)
    drop var1
    gen long obs_no = _n
    reshape long c, i(obs_no)
    drop if missing(c)
    drop _j

    Comment


    • #3
      Thank you very much for you kind help

      Comment


      • #4
        Ask additional questions
        Because there are other strings that do not need to be manipulated up and down, I would like to ask if there is a way to split a specific record. I know that split cannot be connected to if conditional statements, and I tried to use if conditions, but failed to achieve

        If the existing data is as follows:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte group str3 var2 str11 var1
        1 "A"   "A"          
        1 "B"   "B"          
        1 "str" "a"          
        2 "A"   "A"          
        2 "B"   "B"          
        2 "str" "a b"        
        3 "A"   "A"          
        3 "B"   "B"          
        3 "str" "a b c d"    
        4 "A"   "A"          
        4 "B"   "B"          
        4 "str" "a b c d e f"
        5 "A"   "A"          
        5 "B"   "B"          
        5 "str" "a b"        
        end
        The string in var1 of the line corresponding to str in var2 needs to be extended downwards. Other lines in var2, such as the strings in var1 corresponding to lines A and B, remain unchanged. What should we do? Can we do it selectively? split,

        if var2 == "str" {
        split var1, gen(c)
        }

        But it doesn't seem to work, it fails to cut the string

        My target data is as follows:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte group str3 var2 str1 var1
        1 "A"   "A"
        1 "B"   "B"
        1 "str" "a"
        2 "A"   "A"
        2 "B"   "B"
        2 "str" "a"
        2 "str" "b"
        3 "A"   "A"
        3 "B"   "B"
        3 "str" "a"
        3 "str" "b"
        3 "str" "c"
        3 "str" "d"
        4 "A"   "A"
        4 "B"   "B"
        4 "str" "a"
        4 "str" "b"
        4 "str" "c"
        4 "str" "d"
        4 "str" "e"
        4 "str" "f"
        5 "A"   "A"
        5 "B"   "B"
        5 "str" "a"
        5 "str" "b"
        end
        If the above is the target data, how to convert the data from the existing data?
        Thank you very much, Looking forward to your reply

        Comment


        • #5
          #1,Note also

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str15 var1
          "a"              
          "a b"            
          "a b c d"        
          "a b c d e f"    
          "a b c d e f g h"
          "a b c"          
          "a b c d"        
          "a b"            
          "a"              
          "a b c d e f"    
          end
          
          gen wordcount = wordcount(var1)
          sort wordcount 
          list, sepby(wordcount)
          
               +----------------------------+
               |            var1   wordco~t |
               |----------------------------|
            1. |               a          1 |
            2. |               a          1 |
               |----------------------------|
            3. |             a b          2 |
            4. |             a b          2 |
               |----------------------------|
            5. |           a b c          3 |
               |----------------------------|
            6. |         a b c d          4 |
            7. |         a b c d          4 |
               |----------------------------|
            8. |     a b c d e f          6 |
            9. |     a b c d e f          6 |
               |----------------------------|
           10. | a b c d e f g h          8 |
               +----------------------------+
          #4 See https://www.stata.com/support/faqs/p...-if-qualifier/

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input byte group str3 var2 str11 var1
          1 "A"   "A"          
          1 "B"   "B"          
          1 "str" "a"          
          2 "A"   "A"          
          2 "B"   "B"          
          2 "str" "a b"        
          3 "A"   "A"          
          3 "B"   "B"          
          3 "str" "a b c d"    
          4 "A"   "A"          
          4 "B"   "B"          
          4 "str" "a b c d e f"
          5 "A"   "A"          
          5 "B"   "B"          
          5 "str" "a b"        
          end
          
          gen long obsno = _n 
          gen wc = cond(var2 == "str", wordcount(var1), 1) 
          expand wc 
          bysort obsno : gen wanted = word(var1, _n)
          
          list , sepby(obsno)
          
              +--------------------------------------------------+
               | group   var2          var1   obsno   wc   wanted |
               |--------------------------------------------------|
            1. |     1      A             A       1    1        A |
               |--------------------------------------------------|
            2. |     1      B             B       2    1        B |
               |--------------------------------------------------|
            3. |     1    str             a       3    1        a |
               |--------------------------------------------------|
            4. |     2      A             A       4    1        A |
               |--------------------------------------------------|
            5. |     2      B             B       5    1        B |
               |--------------------------------------------------|
            6. |     2    str           a b       6    2        a |
            7. |     2    str           a b       6    2        b |
               |--------------------------------------------------|
            8. |     3      A             A       7    1        A |
               |--------------------------------------------------|
            9. |     3      B             B       8    1        B |
               |--------------------------------------------------|
           10. |     3    str       a b c d       9    4        a |
           11. |     3    str       a b c d       9    4        b |
           12. |     3    str       a b c d       9    4        c |
           13. |     3    str       a b c d       9    4        d |
               |--------------------------------------------------|
           14. |     4      A             A      10    1        A |
               |--------------------------------------------------|
           15. |     4      B             B      11    1        B |
               |--------------------------------------------------|
           16. |     4    str   a b c d e f      12    6        a |
           17. |     4    str   a b c d e f      12    6        b |
           18. |     4    str   a b c d e f      12    6        c |
           19. |     4    str   a b c d e f      12    6        d |
           20. |     4    str   a b c d e f      12    6        e |
           21. |     4    str   a b c d e f      12    6        f |
               |--------------------------------------------------|
           22. |     5      A             A      13    1        A |
               |--------------------------------------------------|
           23. |     5      B             B      14    1        B |
               |--------------------------------------------------|
           24. |     5    str           a b      15    2        a |
           25. |     5    str           a b      15    2        b |
               +--------------------------------------------------+
          
          .

          Comment


          • #6
            Great idea, great program, you are amazing! Thank you very much

            Comment


            • #7
              I suddenly thought, if in reverse, the target data obtained now is used as the original data that needs to be converted, and then the data is converted back to the original initial data, that is, the vertical

              3 "str" "a"
              3 "str" "b"
              3 "str" "c"
              3 "str" "d"

              convert to 3 "str" a b c d
              a b c d The four-letter string (separated by spaces) belongs to a record of variable var1
              That is to reverse the operation to transform the data back, then what should I do? It turns out that I have transformed the data in this way, and the method of forvaluse loop is more troublesome. I want to see if there is any program like yours that can solve this problem. ,Thank you very much.
              Last edited by fu gang; 29 Jun 2022, 01:22.

              Comment


              • #8
                To reverse the process, see https://journals.sagepub.com/doi/pdf...36867X20909698 for concatenation of observations.

                Here's a sketch.

                Code:
                clear
                input byte group str3 var2 str1 wanted
                1 "A"   "A"
                1 "B"   "B"
                1 "str" "a"
                2 "A"   "A"
                2 "B"   "B"
                2 "str" "a"
                2 "str" "b"
                3 "A"   "A"
                3 "B"   "B"
                3 "str" "a"
                3 "str" "b"
                3 "str" "c"
                3 "str" "d"
                4 "A"   "A"
                4 "B"   "B"
                4 "str" "a"
                4 "str" "b"
                4 "str" "c"
                4 "str" "d"
                4 "str" "e"
                4 "str" "f"
                5 "A"   "A"
                5 "B"   "B"
                5 "str" "a"
                5 "str" "b"
                end 
                
                gen long obsno = _n 
                gen which = sum(var2 != var2[_n-1])
                gen concat = wanted if var2 == "str" & var2[_n-1] != "str"
                replace concat = concat[_n-1] + " " + wanted if var2 == "str" & which == which[_n-1] & concat == "" 
                
                list 
                
                bysort which (obsno) : drop if _n < _N 
                replace concat = wanted if concat == "" 
                sort obsno 
                
                drop obsno which wanted 
                
                list 
                 
                     +----------------------------+
                     | group   var2        concat |
                     |----------------------------|
                  1. |     1      A             A |
                  2. |     1      B             B |
                  3. |     1    str             a |
                  4. |     2      A             A |
                  5. |     2      B             B |
                     |----------------------------|
                  6. |     2    str           a b |
                  7. |     3      A             A |
                  8. |     3      B             B |
                  9. |     3    str       a b c d |
                 10. |     4      A             A |
                     |----------------------------|
                 11. |     4      B             B |
                 12. |     4    str   a b c d e f |
                 13. |     5      A             A |
                 14. |     5      B             B |
                 15. |     5    str           a b |
                     +----------------------------+

                Comment


                • #9
                  Great! Thank you very much.

                  Comment


                  • #10
                    After thinking, I got 3 ideas to solve the problem with loops, but the program is not well written, please lend a helping hand, thank you

                    raw data as follows:

                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input byte group str3 keys str11 contens
                    1 "A"   "A"          
                    1 "B"   "B"          
                    1 "str" "a"          
                    2 "A"   "A"          
                    2 "B"   "B"          
                    2 "str" "a b"        
                    3 "A"   "A"          
                    3 "B"   "B"          
                    3 "str" "a b c d"    
                    4 "A"   "A"          
                    4 "B"   "B"          
                    4 "str" "a b c d e f"
                    5 "A"   "A"          
                    5 "B"   "B"          
                    5 "str" "a b"        
                    end
                    target data as follows:
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input byte group str3 keys str1 contents
                    1 "A"   "A"
                    1 "B"   "B"
                    1 "str" "a"
                    2 "A"   "A"
                    2 "B"   "B"
                    2 "str" "a"
                    2 "str" "b"
                    3 "A"   "A"
                    3 "B"   "B"
                    3 "str" "a"
                    3 "str" "b"
                    3 "str" "c"
                    3 "str" "d"
                    4 "A"   "A"
                    4 "B"   "B"
                    4 "str" "a"
                    4 "str" "b"
                    4 "str" "c"
                    4 "str" "d"
                    4 "str" "e"
                    4 "str" "f"
                    5 "A"   "A"
                    5 "B"   "B"
                    5 "str" "a"
                    5 "str" "b"
                    end

                    Idea one:
                    First use the split command to split the string by spaces, insert a line less than the number of words by 1 (because there is a line) according to the number of words, then _g1 replaces the string, _g2 replaces the next line, according to the word The number of cycles repeats until the completion


                    count if keys== "str"
                    local tol= r(N)+_N
                    split contents, gen(_g)
                    forvalues n=1(1)`tol' {
                    if keys== "str" {
                    local wc = wordcount(contents[`n'])-1
                    if `wc'>= 1{
                    insobs `wc', after(`n')
                    }
                    replace contents = _g1 if keys== "str" // This program does not need a loop, but I don't know how to deal with it
                    forvalues b=2/`wc' {
                    replace contents[`=`n'+3-`b''] = _g`b' if keys== "str" // error weights not allowed Replace contents[_n+1] contents[_n+2] contents[_n+3] with _g2 _g3 _g4... in turn until all words are filled in
                    }
                    }
                    }




                    Idea two:
                    Use the ends function of the egen command to split the string into two parts before and after the first space and store them in separate variables, then replace the string before the space (the first word) with the original string, and then add the string before the space (the first word). Insert a line after the space, and fill in the space below the original string with the string after the space. Then the same method splits the string after the first space until it is completely filled.

                    count keys== "str"
                    local tol= r(N)+_N
                    forvalues n=1(1)`tol' {
                    if keys[`n']== "str" {
                    insobs 1, after(`n')
                    }
                    }

                    local wc = wordcount(contents[`n'])

                    egen contents2 = ends(contents),punct(" ")
                    egen contents3 = ends(contents),punct(" ") tail // Split the string in the contents variable into two parts according to the first space, and then loop
                    replace contents = contents2 if keys == "str"
                    replace contents[_n+1] = contents3[_n] if keys[_n+1] == "" // error weights not allowed
                    drop contents2 contents3
                    ……


                    Idea three:
                    Similar to idea 2, use regular expressions to match the words before the space and the words after the space in the string, store them in the temporary element, and then insert them cyclically according to the number of words. This method avoids generation and deletion. variable

                    local first = ustrregexs(1) if ustrregexm(contents,("\w+") // matches the word before the first space, but I don't get the regex to match
                    local tail = ustrregexs(2) if ustrregexm(contents,(?) ) // matches the word after the first space



                    Thank you, please help me to see if my idea works? No matter what kind of solution is very helpful, how to improve the above program, I look forward to your help, thank you very much

                    Comment

                    Working...
                    X