Help needed with generating specific datapoints/observations

Tor Haug Anonsen

Join Date: Oct 2021

Posts: 44
#1

Help needed with generating specific datapoints/observations

10 Jan 2023, 01:19

Hello

I'm working on a big survey dataset and i'm looking for a way of generating specific datapoints based on the information in other variables. This post is sort of continuing on the following posts: https://www.statalist.org/forums/for...riable-problem . Which i got amazing help in regards to how to manipulate a time variable.

Anyways,

For example, i have a variable that is showing the time of day a respondent finished answering a survey called "hour" --> if the respondent finishes the survey between 00:00 and 00:59 = 1, 01:00 - 01:59 = 2,..., 23:00 - 23:59 = 24.

Furthermore, the dataset contains various categorical variables that show what the respondents have answered on the different questions in the survey.

For example, "How positive or negative are you in regards to what you have just seen?" 1 = very negative, 2 = negative, 3 = neither positive or negative, 4 = positive, 5 = very positive.

Now, what I would like to do is to take the average score between the different time intervals such that that "score" is shown through all the observations of that time interval

For example: (q1_average_scores is the variable i want to make)
ID q1_scores hour q1_average_scores

1 3 1 3.5

2 4 1 3.5

3 5 2 3

4 1 2 3

As mentioned, the dataset is quite large and have 10's of thousands more rows. Doing this manually for 24 time periods is a possibility (but, not really), and if I were to do it based on the above example i believe the code would be:

generate q1_average_scores = 0

sum q1_scores in 1/2 ---> Which would give me the average score of ID 1 and 2 = 3.5
replace q1_average_scores = 3.5 in 1/2

then

sum q1_scores in 3/4 --> average = 3
replace q1_average_scores = 3 in 3/4

The problem, of course, is that with 15 more variables and 24 time intervals the job is tedious and also obviously extremely prone to human error, which i would like to avoid. My question therefore is: is there a way to get the same result using another code that would cross-reference the average value of q1_scores and the respective time interval in "hour" and then put the average score in "q1_average_scores"? I dont know if i'm even explaining myself adequately, but hopefully that is understandable.

Thanks in advance for any help!

Tor
Tags: generate data points, generate observations, generate variables, replace
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#2

10 Jan 2023, 01:44

Tor:
I would give the following code a shot:

Code:

bysort hour: egen wanted=mean( q1_scores)

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Tor Haug Anonsen

Join Date: Oct 2021

Posts: 44
#3

10 Jan 2023, 01:59

That worked perfectly, and exactly as intented. Thank you, Carlo, much appreciated!

Tor
Comment
Tor Haug Anonsen

Join Date: Oct 2021

Posts: 44
#4

11 Jan 2023, 06:52

Hi again, Carlo (or anyone that might feel compelled to offer a solution)

I was wondering if it was possible to use the same function in order to make a percentage score instead? More specifically, by just getting the percentage score of all values = 2 OR 3 within a certain time interval

For example
ID q1_scores hour % score

1 2 4 80%

2 2 4 80%

3 3 4 80%

4 2 4 80%

5 5 4 80%

As shown above, it would potentially look something like shown in variable "% score", where 80% of the respondents that have answered in the time interval = 4 has given an answer which corresponds to the value 2 or 3.

Im guessing i am to use a similar bysort code involving IF and OR, which also includes a way of making it a percentage score, but i've been looking at both help bysort, by, sort and egen and haven't found anything that would help.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#5

11 Jan 2023, 07:30

I can't envisage what your data look like. The thread linked in #1 does not help me as it seems to focus only on the handling of dates and times.

In any case, you should (please) by now -- after 35 posts and 15 months on the list -- be familiar with our request to show data examples using dataex.

FAQ Advice #12 applies.

In the absence of a data example. I will answer by analogy. A percent(age) is just one step away from a proportion and a proportion can always be cast as the mean of a (0, 1) indicator.

This is as example you can run in Stata yourself.

Code:

. sysuse auto, clear (1978 automobile data) . egen wanted = mean(100 * cond(rep78 < ., inlist(rep78, 4,5), .)), by(foreign) . tabdisp foreign, c(wanted) ---------------------- Car | origin | wanted ----------+----------- Domestic | 22.91667 Foreign | 85.71429 ---------------------- . tab rep78 foreign, col +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ Repair | record | Car origin 1978 | Domestic Foreign | Total -----------+----------------------+---------- 1 | 2 0 | 2 | 4.17 0.00 | 2.90 -----------+----------------------+---------- 2 | 8 0 | 8 | 16.67 0.00 | 11.59 -----------+----------------------+---------- 3 | 27 3 | 30 | 56.25 14.29 | 43.48 -----------+----------------------+---------- 4 | 9 9 | 18 | 18.75 42.86 | 26.09 -----------+----------------------+---------- 5 | 2 9 | 11 | 4.17 42.86 | 15.94 -----------+----------------------+---------- Total | 48 21 | 69 | 100.00 100.00 | 100.00

There is quite a lot to unpack here, but here goes, The core is that a percentage is 100 times a proportion.

Mixing code and pseudocode, the main idea is

egen wanted = mean(100 * true-or-false-expression), by(whatever)

where I should flag that

egen wanted = 100 * mean(true-or-false-expression), by(whatever)

may seem equivalent, but that is illegal as Stata code. The trick is writing down the true-or-false expression as code.

inlist(rep78, 3, 4)

is true (returning 1) if rep78 is 3 or 4 and false (returning 0) otherwise.

It is elementary and also fundamental that the mean of a series of 0s and 1s is a proportion just as the mean of

0, 0, 0, 1, 1, 1, 1, 1, 1, 1

is 0.7 as those magnificent seven 1s are mixed with three 0s.

The twist here is that missing values also count as "not 3 or 4", so the code insists on ignoring such values. Know that Stata always ignores missing values as far as possible when calculating means.

cond(rep78 < ., inlist(rep78, 4,5), .)

says

if rep78 < . evaluate inlist(rep78, 4, 5)
otherwise return missing

.
1 like
Comment
Tor Haug Anonsen

Join Date: Oct 2021

Posts: 44
#6

12 Jan 2023, 01:55

Hello and thanks for your reply, Nick!

My apologies, i thought providing a table of example data was sufficient. I will start using the dataex command from now on.

I'm not quite sure i understand what the specific code i am to use in order to get the percentage value, but I think i somewhat understand the process. Using dataex:

input long(ID q1_awareness) float hour
467 5 8
468 3 9
469 1 9
470 4 10
471 5 10
472 4 10
473 3 10
474 4 10
475 4 10
476 4 10
477 3 10
478 4 10

My first step was to find the average value within specific time intervals. The code Carlo provided worked perfectly for this:

bysort hour: egen wanted=mean(q1_awareness)

Now, what I think you're trying to tell me is to use a code that creates a variable where the values are 0 or 1. It will be 1 if a criteria is satisfied (for example, the criteria can be that within the time intervals the value in q1_awareness needs to be either 4 or 5), and it will return 0 if the values are anything else than 4 or 5. Furthermore, it takes the mean of the 0 and 1's within the different time intervals and multiplies it by 100 in order to get the actual percentage value. (btw, it does not need to be multiplied by 100 since the information of f.ex 0.31 or 31 will essentially be the same).

Again, i am not quite sure how to type the specific code needed for this operation, but based on your explanation would something like the below code be correct?

egen wanted = 100*mean(q1_awareness if q1_awareness == 4 OR 5), by(hour)

Obviously, this code wont work in STATA, and my challenge is getting the condition term correct.

Last edited by Tor Haug Anonsen; 12 Jan 2023, 01:58.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#7

12 Jan 2023, 02:55

Thanks for the (slightly incomplete) data example.

In #5 I was at pains to flag that 100 * mean(something) is illegal as an egen function call, but mean(100 * something) is legal. This is surprising mathematically but from a Stata view isn't difficult to explain: the syntax of egen as shown in the help is strict that a function call must immediately follow the equals sign.

if qualifiers aren't allowed within function calls. That is why we need true-or-false checks, say using inlist() or inrange().

The code you want is quite close to my example in #5. As you say, a factor of 100 is utterly a matter of convention and not compulsory, so in this version we will leave it out.

Being cautious about missing values does no harm.

Code:

clear input long(ID q1_awareness) float hour 467 5 8 468 3 9 469 1 9 470 4 10 471 5 10 472 4 10 473 3 10 474 4 10 475 4 10 476 4 10 477 3 10 478 4 10 end egen pc45 = mean(cond(q1_awareness < ., inlist(q1_awareness, 4,5), .)), by(hour) list, sepby(hour) +----------------------------------+ | ID q1_awa~s hour pc45 | |----------------------------------| 1. | 467 5 8 1 | |----------------------------------| 2. | 468 3 9 0 | 3. | 469 1 9 0 | |----------------------------------| 4. | 470 4 10 .7777778 | 5. | 471 5 10 .7777778 | 6. | 472 4 10 .7777778 | 7. | 473 3 10 .7777778 | 8. | 474 4 10 .7777778 | 9. | 475 4 10 .7777778 | 10. | 476 4 10 .7777778 | 11. | 477 3 10 .7777778 | 12. | 478 4 10 .7777778 | +----------------------------------+

See also https://www.stata-journal.com/articl...article=dm0055 -- especially Section 9.

To maintain my reputation as picky, I flag https://www.statalist.org/forums/help#spelling and that criterion is the singular of criteria.
1 like
Comment
Tor Haug Anonsen

Join Date: Oct 2021

Posts: 44
#8

12 Jan 2023, 04:34

Ah, okey. I understand. Thank you for providing a very good example, which worked perfectly! (and also giving me a lecture in grammar, all though, in my defense, english is not my first language and i do know the plural and singular spelling of the word in Norwegian )
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#9

12 Jan 2023, 04:41

Picky on picky: the grammar here is really Greek, as English singular and plural are copied from the Greek originals. just as with phenomenon and phenomena.

But I know absolutely no Norwegian. I can tell you that many English-speaking academics and students get this wrong too.
Comment
Tor Haug Anonsen

Join Date: Oct 2021

Posts: 44
#10

12 Jan 2023, 05:26

Til
Comment

ID	q1_scores	hour	q1_average_scores
1	3	1	3.5
2	4	1	3.5
3	5	2	3
4	1	2	3

ID	q1_scores	hour	% score
1	2	4	80%
2	2	4	80%
3	3	4	80%
4	2	4	80%
5	5	4	80%

Announcement

Help needed with generating specific datapoints/observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment