How to do the count and data conversion of the number of strings in the following data

fu gang

Join Date: Jan 2021

Posts: 138
#1

How to do the count and data conversion of the number of strings in the following data

28 Jun 2022, 22:07

I have a set of data, the existing data is as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str15 var1 "a" "a b" "a b c d" "a b c d e f" "a b c d e f g h" "a b c" "a b c d" "a b" "a" "a b c d e f" end

The two existing problems are as follows:
Question 1. You need to know the number of letters in the string in the var1 variable in each record (or the number of spaces, which are separated by spaces and have no spaces at the beginning and the end), how to find, except for the moss command, I have already used this method made it

Question 2. How to convert the data into the following data? Is there a good way to realize data manipulation?

The target data is as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float group2 str1 var2 1 "a" 2 "a" 2 "b" 3 "a" 3 "b" 3 "c" 3 "d" 4 "a" 4 "b" 4 "c" 4 "d" 4 "e" 4 "f" 5 "a" 5 "b" 5 "c" 5 "d" 5 "e" 5 "f" 5 "g" 5 "h" 6 "a" 6 "b" 6 "c" 7 "a" 7 "b" 7 "c" 7 "d" 8 "a" 8 "b" 9 "a" 10 "a" 10 "b" 10 "c" 10 "d" 10 "e" 10 "f" end

How to do the above data conversion?
Thank you very much .
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30111

28 Jun 2022, 22:19

Code:

//  COUNT NUMBER OF SPACES (ADD 1 FOR THE NUMBER OF TOKENS)
gen long number_of_spaces = strlen(var1) - strlen(subinstr(var1, " ", "", .))

//  SPLIT AND GO LONG
split var1, gen(c)
drop var1
gen long obs_no = _n
reshape long c, i(obs_no)
drop if missing(c)
drop _j

Comment

fu gang

Join Date: Jan 2021

Posts: 138
#3

28 Jun 2022, 22:46

Thank you very much for you kind help
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#4

28 Jun 2022, 23:20

Ask additional questions
Because there are other strings that do not need to be manipulated up and down, I would like to ask if there is a way to split a specific record. I know that split cannot be connected to if conditional statements, and I tried to use if conditions, but failed to achieve

If the existing data is as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte group str3 var2 str11 var1 1 "A" "A" 1 "B" "B" 1 "str" "a" 2 "A" "A" 2 "B" "B" 2 "str" "a b" 3 "A" "A" 3 "B" "B" 3 "str" "a b c d" 4 "A" "A" 4 "B" "B" 4 "str" "a b c d e f" 5 "A" "A" 5 "B" "B" 5 "str" "a b" end

The string in var1 of the line corresponding to str in var2 needs to be extended downwards. Other lines in var2, such as the strings in var1 corresponding to lines A and B, remain unchanged. What should we do? Can we do it selectively? split,

if var2 == "str" {
split var1, gen(c)
}

But it doesn't seem to work, it fails to cut the string

My target data is as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte group str3 var2 str1 var1 1 "A" "A" 1 "B" "B" 1 "str" "a" 2 "A" "A" 2 "B" "B" 2 "str" "a" 2 "str" "b" 3 "A" "A" 3 "B" "B" 3 "str" "a" 3 "str" "b" 3 "str" "c" 3 "str" "d" 4 "A" "A" 4 "B" "B" 4 "str" "a" 4 "str" "b" 4 "str" "c" 4 "str" "d" 4 "str" "e" 4 "str" "f" 5 "A" "A" 5 "B" "B" 5 "str" "a" 5 "str" "b" end

If the above is the target data, how to convert the data from the existing data?
Thank you very much, Looking forward to your reply
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35709

28 Jun 2022, 23:54

#1,Note also

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str15 var1
"a"              
"a b"            
"a b c d"        
"a b c d e f"    
"a b c d e f g h"
"a b c"          
"a b c d"        
"a b"            
"a"              
"a b c d e f"    
end

gen wordcount = wordcount(var1)
sort wordcount 
list, sepby(wordcount)

     +----------------------------+
     |            var1   wordco~t |
     |----------------------------|
  1. |               a          1 |
  2. |               a          1 |
     |----------------------------|
  3. |             a b          2 |
  4. |             a b          2 |
     |----------------------------|
  5. |           a b c          3 |
     |----------------------------|
  6. |         a b c d          4 |
  7. |         a b c d          4 |
     |----------------------------|
  8. |     a b c d e f          6 |
  9. |     a b c d e f          6 |
     |----------------------------|
 10. | a b c d e f g h          8 |
     +----------------------------+

#4 See https://www.stata.com/support/faqs/p...-if-qualifier/

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte group str3 var2 str11 var1
1 "A"   "A"          
1 "B"   "B"          
1 "str" "a"          
2 "A"   "A"          
2 "B"   "B"          
2 "str" "a b"        
3 "A"   "A"          
3 "B"   "B"          
3 "str" "a b c d"    
4 "A"   "A"          
4 "B"   "B"          
4 "str" "a b c d e f"
5 "A"   "A"          
5 "B"   "B"          
5 "str" "a b"        
end

gen long obsno = _n 
gen wc = cond(var2 == "str", wordcount(var1), 1) 
expand wc 
bysort obsno : gen wanted = word(var1, _n)

list , sepby(obsno)

    +--------------------------------------------------+
     | group   var2          var1   obsno   wc   wanted |
     |--------------------------------------------------|
  1. |     1      A             A       1    1        A |
     |--------------------------------------------------|
  2. |     1      B             B       2    1        B |
     |--------------------------------------------------|
  3. |     1    str             a       3    1        a |
     |--------------------------------------------------|
  4. |     2      A             A       4    1        A |
     |--------------------------------------------------|
  5. |     2      B             B       5    1        B |
     |--------------------------------------------------|
  6. |     2    str           a b       6    2        a |
  7. |     2    str           a b       6    2        b |
     |--------------------------------------------------|
  8. |     3      A             A       7    1        A |
     |--------------------------------------------------|
  9. |     3      B             B       8    1        B |
     |--------------------------------------------------|
 10. |     3    str       a b c d       9    4        a |
 11. |     3    str       a b c d       9    4        b |
 12. |     3    str       a b c d       9    4        c |
 13. |     3    str       a b c d       9    4        d |
     |--------------------------------------------------|
 14. |     4      A             A      10    1        A |
     |--------------------------------------------------|
 15. |     4      B             B      11    1        B |
     |--------------------------------------------------|
 16. |     4    str   a b c d e f      12    6        a |
 17. |     4    str   a b c d e f      12    6        b |
 18. |     4    str   a b c d e f      12    6        c |
 19. |     4    str   a b c d e f      12    6        d |
 20. |     4    str   a b c d e f      12    6        e |
 21. |     4    str   a b c d e f      12    6        f |
     |--------------------------------------------------|
 22. |     5      A             A      13    1        A |
     |--------------------------------------------------|
 23. |     5      B             B      14    1        B |
     |--------------------------------------------------|
 24. |     5    str           a b      15    2        a |
 25. |     5    str           a b      15    2        b |
     +--------------------------------------------------+

.

Comment

fu gang

Join Date: Jan 2021

Posts: 138
#6

29 Jun 2022, 00:47

Great idea, great program, you are amazing! Thank you very much
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#7

29 Jun 2022, 01:16

I suddenly thought, if in reverse, the target data obtained now is used as the original data that needs to be converted, and then the data is converted back to the original initial data, that is, the vertical

3 "str" "a"
3 "str" "b"
3 "str" "c"
3 "str" "d"

convert to 3 "str" a b c d
a b c d The four-letter string (separated by spaces) belongs to a record of variable var1
That is to reverse the operation to transform the data back, then what should I do? It turns out that I have transformed the data in this way, and the method of forvaluse loop is more troublesome. I want to see if there is any program like yours that can solve this problem. ,Thank you very much.

Last edited by fu gang; 29 Jun 2022, 01:22.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35709

29 Jun 2022, 05:45

To reverse the process, see https://journals.sagepub.com/doi/pdf...36867X20909698 for concatenation of observations.

Here's a sketch.

Code:

clear
input byte group str3 var2 str1 wanted
1 "A"   "A"
1 "B"   "B"
1 "str" "a"
2 "A"   "A"
2 "B"   "B"
2 "str" "a"
2 "str" "b"
3 "A"   "A"
3 "B"   "B"
3 "str" "a"
3 "str" "b"
3 "str" "c"
3 "str" "d"
4 "A"   "A"
4 "B"   "B"
4 "str" "a"
4 "str" "b"
4 "str" "c"
4 "str" "d"
4 "str" "e"
4 "str" "f"
5 "A"   "A"
5 "B"   "B"
5 "str" "a"
5 "str" "b"
end 

gen long obsno = _n 
gen which = sum(var2 != var2[_n-1])
gen concat = wanted if var2 == "str" & var2[_n-1] != "str"
replace concat = concat[_n-1] + " " + wanted if var2 == "str" & which == which[_n-1] & concat == "" 

list 

bysort which (obsno) : drop if _n < _N 
replace concat = wanted if concat == "" 
sort obsno 

drop obsno which wanted 

list 
 
     +----------------------------+
     | group   var2        concat |
     |----------------------------|
  1. |     1      A             A |
  2. |     1      B             B |
  3. |     1    str             a |
  4. |     2      A             A |
  5. |     2      B             B |
     |----------------------------|
  6. |     2    str           a b |
  7. |     3      A             A |
  8. |     3      B             B |
  9. |     3    str       a b c d |
 10. |     4      A             A |
     |----------------------------|
 11. |     4      B             B |
 12. |     4    str   a b c d e f |
 13. |     5      A             A |
 14. |     5      B             B |
 15. |     5    str           a b |
     +----------------------------+

Comment

fu gang

Join Date: Jan 2021

Posts: 138
#9

29 Jun 2022, 11:31

Great! Thank you very much.
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#10

01 Jul 2022, 22:12

After thinking, I got 3 ideas to solve the problem with loops, but the program is not well written, please lend a helping hand, thank you

raw data as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte group str3 keys str11 contens 1 "A" "A" 1 "B" "B" 1 "str" "a" 2 "A" "A" 2 "B" "B" 2 "str" "a b" 3 "A" "A" 3 "B" "B" 3 "str" "a b c d" 4 "A" "A" 4 "B" "B" 4 "str" "a b c d e f" 5 "A" "A" 5 "B" "B" 5 "str" "a b" end

target data as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte group str3 keys str1 contents 1 "A" "A" 1 "B" "B" 1 "str" "a" 2 "A" "A" 2 "B" "B" 2 "str" "a" 2 "str" "b" 3 "A" "A" 3 "B" "B" 3 "str" "a" 3 "str" "b" 3 "str" "c" 3 "str" "d" 4 "A" "A" 4 "B" "B" 4 "str" "a" 4 "str" "b" 4 "str" "c" 4 "str" "d" 4 "str" "e" 4 "str" "f" 5 "A" "A" 5 "B" "B" 5 "str" "a" 5 "str" "b" end

Idea one:
First use the split command to split the string by spaces, insert a line less than the number of words by 1 (because there is a line) according to the number of words, then _g1 replaces the string, _g2 replaces the next line, according to the word The number of cycles repeats until the completion

count if keys== "str"
local tol= r(N)+_N
split contents, gen(_g)
forvalues n=1(1)`tol' {
if keys== "str" {
local wc = wordcount(contents[`n'])-1
if `wc'>= 1{
insobs `wc', after(`n')
}
replace contents = _g1 if keys== "str" // This program does not need a loop, but I don't know how to deal with it
forvalues b=2/`wc' {
replace contents[`=`n'+3-`b''] = _g`b' if keys== "str" // error weights not allowed Replace contents[_n+1] contents[_n+2] contents[_n+3] with _g2 _g3 _g4... in turn until all words are filled in
}
}
}

Idea two:
Use the ends function of the egen command to split the string into two parts before and after the first space and store them in separate variables, then replace the string before the space (the first word) with the original string, and then add the string before the space (the first word). Insert a line after the space, and fill in the space below the original string with the string after the space. Then the same method splits the string after the first space until it is completely filled.

count keys== "str"
local tol= r(N)+_N
forvalues n=1(1)`tol' {
if keys[`n']== "str" {
insobs 1, after(`n')
}
}

local wc = wordcount(contents[`n'])

egen contents2 = ends(contents),punct(" ")
egen contents3 = ends(contents),punct(" ") tail // Split the string in the contents variable into two parts according to the first space, and then loop
replace contents = contents2 if keys == "str"
replace contents[_n+1] = contents3[_n] if keys[_n+1] == "" // error weights not allowed
drop contents2 contents3
……

Idea three:
Similar to idea 2, use regular expressions to match the words before the space and the words after the space in the string, store them in the temporary element, and then insert them cyclically according to the number of words. This method avoids generation and deletion. variable

local first = ustrregexs(1) if ustrregexm(contents,（"\w+") // matches the word before the first space, but I don't get the regex to match
local tail = ustrregexs(2) if ustrregexm(contents,（？) ) // matches the word after the first space

Thank you, please help me to see if my idea works? No matter what kind of solution is very helpful, how to improve the above program, I look forward to your help, thank you very much
Comment

Announcement

How to do the count and data conversion of the number of strings in the following data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment