Why the results of loop is different?

Fred Lee

Join Date: Nov 2017
Posts: 473

Why the results of loop is different?

31 May 2023, 19:10

With different numbers of observation, why the results of variable presenterFirst_text for firmName 2 are different?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str96 firmName str78 QAText1 str1686 QAText2 strL(QAText3 QAText4)
"1" "" "AA4 04:44这个是有一个共同手吗？" "BB1 04:52对，存在是。北京南和银行有限公司大概。" "AA4 1 05:04收入可能有280多万，成本25,000。这个是成本为什么这么低？"
"2" "" "AA3 00:29卖哪去那。"                   "AA3 00:31谁用的谁在用。"                                     "BB1 00:32不，他的目标群体是面向大众的。"                                  
end
gen questionerFirst_text = ""
gen presenterFirst_text = ""

local i = 1
local j = ustrpos(QAText1,"BB")
while `i' <=4 {
    if `j' == 1 {
        replace presenterFirst_text =  usubstr(QAText`i',10,.)
        continue, break
    }
    else {
        local i = `i' + 1
        local j = ustrpos(QAText`i',"BB")
    }
}

The below code should get the same results, however it is different:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str96 firmName str78 QAText1 str1686 QAText2 strL(QAText3 QAText4)
"2" "" "AA3 00:29卖哪去那。" "AA3 00:31谁用的谁在用。" "BB1 00:32不，他的目标群体是面向大众的。"
end
gen questionerFirst_text = ""
gen presenterFirst_text = ""

local i = 1
local j = ustrpos(QAText1,"BB")
while `i' <=4 {
    if `j' == 1 {
        replace presenterFirst_text =  usubstr(QAText`i',10,.)
        continue, break
    }
    else {
        local i = `i' + 1
        local j = ustrpos(QAText`i',"BB")
    }
}

Last edited by Fred Lee; 31 May 2023, 19:12.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#2

31 May 2023, 21:19

Are you aware that all of your -ustrpos(QAText1, "BB")- expressions refer only to the value of QAText1 in the first observation in the data set? That is because you are using it in a context, -local j = ...- that calls for a scalar, not a vector. Whenever a variable (vector) appears in a scalar expression in Stata it will either be a syntax error, or it will be interpreted as the value of that variable in the first observation.

The two data sets are different. In the first data set, first observation's value of QAText2 does begin with "BB", so when `i' becomes 2 at the end of the first iteration of the -while- loop, `j' is set to 1, and then on the second iteration, presenterFirst_text is set to the substring of QAText2 in observation 1 beginning at the 10th character, and you then break out of the loop.

By contrast, in the second data set, the value of QAText2 in the first observation does not contain "BB", so `j' remains 0, and the loop continues. Only when we get to i = 4 to we finally encounter a BB in the first observation. So the value of presenterFirst_txt is set to the substring of QAText4 in observation 1 beginning at the 10th character, and, at that point the loop ends due to failure of the -while `i' <= 4- condition.

So that is why the results are different: the only data that affect the running of the loop are those in the first observation, and the two examples you show have materially different data in that first observation.

Added: I'm not certain I understand what you are attempting to do with the code. But the code is very un-Stataish. While it is all perfectly legal, it doesn't seem to accomplish what you want, at least not consistently so, and it uses constructs that are familiar in other programming languages but are seldom used in Stata (-while-, -continue, break-). My guess is that you are trying to run through the various QAText* variables until you find the first one that begins with BB. Then you want to extract the substring starting from the 10th character of that and put it in a new variable called presenterFirst_text. I would also guess that you want to do this for each observation in the data set. (The code you wrote never looks at any observation but the first.) If I have guessed your aim correctly, I suggest you do this instead:

Code:

reshape long QAText, i(firmName) by firmName (_j): egen first_BB = min(cond(ustrpos(QAText, "BB") == 1, _j, .)) by firmName: gen questionerFirst_text = usubstr(QAText[first_BB], 10, .) reshape wide // PROBABLY OMIT THIS--SEE NOTE BELOW

Note: I don't know what you will be doing with the data after this. I will just point out that most data management in Stata is more easily carried out with the data in the long layout that arises from the -reshape long- command. The -reshape wide- at the end takes you back to the original wide layout--but it is likely that using the wide layout will just cause you difficulties (as it already has leading you to write complicated, opaque code that malfunctioned in ways you did not grasp). So unless you know for sure that you will be doing something with this data that requires the wide layout, I suggest you omit the final -reshape wide- and continue to work with the data in long layout.

Last edited by Clyde Schechter; 31 May 2023, 21:32.
Comment
Fred Lee

Join Date: Nov 2017

Posts: 473
#3

31 May 2023, 21:37

Originally posted by Clyde Schechter View Post

Are you aware that all of your -ustrpos(QAText1, "BB")- expressions refer only to the value of QAText1 in the first observation in the data set? That is because you are using it in a context, -local j = ...- that calls for a scalar, not a vector. Whenever a variable (vector) appears in a scalar expression in Stata it will either be a syntax error, or it will be interpreted as the value of that variable in the first observation.

The two data sets are different. In the first data set, first observation's value of QAText2 does begin with "BB", so when `i' becomes 2 at the end of the first iteration of the -while- loop, `j' is set to 1, and then on the second iteration, presenterFirst_text is set to the substring of QAText2 in observation 1 beginning at the 10th character, and you then break out of the loop.

By contrast, in the second data set, the value of QAText2 in the first observation does not contain "BB", so `j' remains 0, and the loop continues. Only when we get to i = 4 to we finally encounter a BB in the first observation. So the value of presenterFirst_txt is set to the substring of QAText4 in observation 1 beginning at the 10th character, and, at that point the loop ends due to failure of the -while `i' <= 4- condition.

So that is why the results are different: the only data that affect the running of the loop are those in the first observation, and the two examples you show have materially different data in that first observation.

Added: I'm not certain I understand what you are attempting to do with the code. But the code is very un-Stataish. While it is all perfectly legal, it doesn't seem to accomplish what you want, at least not consistently so, and it uses constructs that are familiar in other programming languages but are seldom used in Stata (-while-, -continue, break-). My guess is that you are trying to run through the various QAText* variables until you find the first one that begins with BB. Then you want to extract the substring starting from the 10th character of that and put it in a new variable called presenterFirst_text. I would also guess that you want to do this for each observation in the data set. (The code you wrote never looks at any observation but the first.) If I have guessed your aim correctly, I suggest you do this instead:

Code:

reshape long QAText, i(firmName) by firmName (_j): egen first_BB = min(cond(ustrpos(QAText, "BB") == 1, _j, .)) by firmName: gen questionerFirst_text = usubstr(QAText[first_BB], 10, .) reshape wide // PROBABLY OMIT THIS--SEE NOTE BELOW

Note: I don't know what you will be doing with the data after this. I will just point out that most data management in Stata is more easily carried out with the data in the long layout that arises from the -reshape long- command. The -reshape wide- at the end takes you back to the original wide layout--but it is likely that using the wide layout will just cause you difficulties (as it already has leading you to write complicated, opaque code that malfunctioned in ways you did not grasp). So unless you know for sure that you will be doing something with this data that requires the wide layout, I suggest you omit the final -reshape wide- and continue to work with the data in long layout.

Thanks Clyde! I first realize that -local j = ...- that calls for a scalar, not a vector. How to correct this loop, sicne I want it to be a vector.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35781
#4

01 Jun 2023, 03:39

#3 is already answered in #2. I think. Clyde's code looks at all observations, not just the first.
Comment
Fred Lee

Join Date: Nov 2017

Posts: 473
#5

01 Jun 2023, 04:53

Originally posted by Nick Cox View Post

#3 is already answered in #2. I think. Clyde's code looks at all observations, not just the first.

Thanks, Nick! I ddn't notice that Clyde modified the post. That code works well for the task. However, I have a lot of variables in my dataset, therefore using command "reshape" for other variables will cause problems. Clyde's understanding is right, do you have ways to implement it without chanding the data structure (not using reshape). Thanks a ton!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35781
#6

01 Jun 2023, 05:00

I don't have different suggestions and indeed to be frank have not tried to understand the entire problem. If you have a very big dataset apart from advising that you use a computer to match -- easy if possibly impractical advice -- it is sometimes easier just to work on a smaller dataset with fewer variables and then re-combine later. But it remains axiomatic that

Almost everything you want to do with datasets that could be long or wide in layour is easier in Stata if they are made long.

Sorry, but I don't read Chinese and I don't understand what the data are so it would not be helpful to make any further guesses.
Comment
Fred Lee

Join Date: Nov 2017

Posts: 473
#7

01 Jun 2023, 05:08

In my original dataset, the same firm have multiple observations. Using command reshape from wide to long, and then long to wide, will delethe the multiple observations for the same firm. I am thinking that maybe I need to use the firm name information and other information to identify each unique observation.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35781
#8

01 Jun 2023, 05:18

It's not a surprise to hear that you have other variables but they weren't presented as part of the problem. The premise in #7 is incorrect. If you reshape one way, you can reshape back without loss of information so long as the command is given identifiers. Usually this kind of dataset consists of firms and times but other set-ups are naturally possible.
Comment
Fred Lee

Join Date: Nov 2017

Posts: 473
#9

01 Jun 2023, 05:33

Oh yes, you are right, the sample size didn't change. That was my mistake!
Comment

Announcement

Why the results of loop is different?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment