Help with "leave one out" mode

Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#1

Help with "leave one out" mode

16 Oct 2023, 12:27

Hi, I'm trying to calculate the following leave one out mode: I have data on student postcode of residence and their school code. For each student I want to find the modal postcode of all students in their school except that particular student. I followed the solution given in this thread.

The code I used:

Code:

gen modal_pcd=. gen value_pcd=. forvalues i=1/ `=N' { replace value_pcd = cond (school == school [`i'] & _n!= `i', postcode, .) egen tempmode=mode (value_pcd), minmode replace modal_pcd = tempmode in `i' drop tempmode

But I get error r(198): "N not found invalid syntax"

Would appreciate any help.

Thanks,
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#2

16 Oct 2023, 12:37

`=N' should be `=_N'.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#3

18 Oct 2023, 04:58

Ahh now I see it! thanks a lot!

I modified the code as below since my postcode variable is string and school is numeric.

Code:

gen modal_pcd= " " gen value_pcd= " " tostring school, gen(school_str) forvalues i=1/ `=_N' { replace value_pcd = cond (school_str == school_str [`i'] & _n!= `i', postcode, .) egen tempmode=mode (value_pcd), minmode replace modal_pcd = tempmode in `i' drop tempmode

But this giver error type mismatch r(109)

I'm not sure what's going wrong here?

Last edited by Titir Bhattacharya; 18 Oct 2023, 05:00.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#4

18 Oct 2023, 08:48

Code:

replace value_pcd = cond (school_str == school_str [`i'] & _n!= `i', postcode, .)

is a type mismatch because you initial define variable value_pcd to be a string (-gen value_pcd = " "-). So you cannot then propose to set its value to numeric missing (.). String missing value is not ., it is "". It is possible that postcode is also numeric and causes a type mismatch, but as postcode is not defined within the code you show, this is unknowable.

In general, when asking for help with code, it is a good idea to show example data. The helpful way to do that is with the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#5

19 Oct 2023, 11:31

Originally posted by Clyde Schechter View Post

Code:

replace value_pcd = cond (school_str == school_str [`i'] & _n!= `i', postcode, .)

is a type mismatch because you initial define variable value_pcd to be a string (-gen value_pcd = " "-). So you cannot then propose to set its value to numeric missing (.). String missing value is not ., it is "". It is possible that postcode is also numeric and causes a type mismatch, but as postcode is not defined within the code you show, this is unknowable.

In general, when asking for help with code, it is a good idea to show example data. The helpful way to do that is with the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Thanks Clyde, your suggestion was helpful and the code is working. I'll also keep in mind the point about example data.

I had a follow up question: the actual data on which I'm running the code has about 150,000,000 observations. So the code has been running for ever. Especially, I can see that while there was one shot change to

Code:

value_pcd

, with 570 real changes, there are sequential changes in 1/2 observations at a time to

Code:

modal_pcd

and it has been running for a very long time. So, my output window looks like this:

Code:

variable value_pcd was str1 now str8 (570 real changes made) variable modal_pcd was str1 now str6 (1 real change made) (2 real changes made) (1 real change made) (2 real changes made) . . . .

Is there any way to modify the code so that it runs relatively faster, or perhaps with fewer iterations to modal_pcd?

I look forward to your suggestions.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#6

19 Oct 2023, 11:42

Yes, the code can probably be sped up a bit. But I wouldn't attempt it without example data to work with, and I doubt anybody else would either. Imaginary code based on imaginary data has a way of working badly in real data. Please post back using the -dataex- command to show example data. As you have over 200 posts here I assume that by now you are familiar with -dataex-. If not, do see Forum FAQ #12.
Comment

Titir Bhattacharya

Join Date: Mar 2019
Posts: 226

19 Oct 2023, 11:57

Thanks Clyde.
Here is the example data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str3 studentid float acadyr str1 postcode float school
"123" 2010 "A" 111
"123" 2011 "A" 111
"123" 2012 "B" 111
"123" 2013 "C" 111
"123" 2014 "A" 111
"123" 2015 "D" 111
"124" 2012 "S" 111
"124" 2013 "C" 111
"124" 2015 "C" 112
"124" 2016 "C" 112
"124" 2017 "S" 111
"125" 2007 "A" 111
"125" 2008 "A" 112
"125" 2009 "A" 111
"126" 2012 "S" 111
"126" 2014 "B" 112
"126" 2015 "C" 112
"126" 2016 "D" 111
"126" 2017 "A" 112
"126" 2018 "A" 114
end

Apologies for not being able to post an excerpt of the actual data: i work with it on a secure server and cannot take the data out of the project area. Hope this helps!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#8

19 Oct 2023, 12:34

I believe this will run noticeably faster, though I doubt the speedup will be dramatic. We'll see.

Code:

capture program drop one_school program define one_school by postcode, sort: gen freq = _N gsort -freq postcode gen modal_pcd = postcode[1] local alternate = freq[1] + 1 if freq[`alternate'] > freq[1] - 1 { replace modal_pcd = postcode[`alternate'] in 1/`=freq[1]' } exit end runby one_school, by(school) status

Notes:
1. -runby- is written by Robert Picard and me; it is available from SSC.
2. When you used -egen mode, minmode-, with postcode being a string variable, you implicitly mean "alphabetically first" as minimum. This code rests on the same understanding.
3. The potential speedup comes from a few things. First, instead of looping over individual observations, this code processes each school as a batch. Instead of using the -egen mode()- function, we calculate the overall minimum valued mode directly, saving a slight amount of overhead. Next, the exclusion of the student's own postal code value from calculating the mode can only change the result if that student is in the overall min-valued modal group. And then only if the frequency of that min-valued modal postcode, when diminished by one, is exceeded by the frequency of the next best group. And, again, that adjustment can be done in a group defined by an -in- condition, rather than one observation at a time.

Last edited by Clyde Schechter; 19 Oct 2023, 12:47. Reason: Change >= to > in code.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#9

19 Oct 2023, 13:23

Sorry, but I still have the edge case wrong in edited #8. I believe this is correct:

Code:

capture program drop one_school program define one_school by postcode, sort: gen freq = _N gsort -freq postcode gen modal_pcd = postcode[1] local alternate = freq[1] + 1 if (freq[`alternate'] > freq[1] - 1) /// | ((freq[`alternate'] == freq[1]-1) & postcode[`alternate'] < postcode[1]) { replace modal_pcd = postcode[`alternate'] in 1/`=freq[1]' } exit end runby one_school, by(school) status

The logic is this: excluding the self lowers the frequency of the previously calculated overall min-value modal postcode by 1. There are three cases:
1. The reduced value is still greater than the frequency of the "runner up" postcode. In this case nothing needs to change.
2. The reduced value is less than the frequency of the "runner up" postcode. Then the runner-up postcode becomes the new min-valued mode for the students in the original min-value modal postcode.
3. The reduced value now ties the frequency of the "runner up" postcode. This has two subcases. 3a) The original postcode is still alphabetically earlier than the "runner up" postcode--nothing needs to change. 3b) The original post code is alphabetically later than the "runner up." In this case, the original "runner up" postcode becomes the new min-valued mode for the students in the original min-value modal postcode.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#10

20 Oct 2023, 16:49

Originally posted by Clyde Schechter View Post

Sorry, but I still have the edge case wrong in edited #8. I believe this is correct:

Code:

capture program drop one_school program define one_school by postcode, sort: gen freq = _N gsort -freq postcode gen modal_pcd = postcode[1] local alternate = freq[1] + 1 if (freq[`alternate'] > freq[1] - 1) /// | ((freq[`alternate'] == freq[1]-1) & postcode[`alternate'] < postcode[1]) { replace modal_pcd = postcode[`alternate'] in 1/`=freq[1]' } exit end runby one_school, by(school) status

The logic is this: excluding the self lowers the frequency of the previously calculated overall min-value modal postcode by 1. There are three cases:
1. The reduced value is still greater than the frequency of the "runner up" postcode. In this case nothing needs to change.
2. The reduced value is less than the frequency of the "runner up" postcode. Then the runner-up postcode becomes the new min-valued mode for the students in the original min-value modal postcode.
3. The reduced value now ties the frequency of the "runner up" postcode. This has two subcases. 3a) The original postcode is still alphabetically earlier than the "runner up" postcode--nothing needs to change. 3b) The original post code is alphabetically later than the "runner up." In this case, the original "runner up" postcode becomes the new min-valued mode for the students in the original min-value modal postcode.

Hi Clyde, Thank you for the code, I'm in the process of running it.

I was wondering if there might be some alternative to runby? Since I work on a server, I cannot install user written commands by myself using SSC and would have to wait till the server administrators put the ado file in my project area.

However, I completely understand if there aren't any, and I will wait till it is sorted at the backend.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 549
#11

20 Oct 2023, 17:08

You wrote

I cannot install user written commands by myself using SSC and would have to wait till the server administrators put the ado file in my project area.

I am not sure whether this will help in your situation, but did you have a look at sysdir (see help sysdir and the respective PDF documentation)?
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#12

20 Oct 2023, 17:21

Hi Dirk,

I looked at sysdir following your advice, and was able to see the adopath, personal ado directory etc.
I also tried ssc install and expected I was not able to access it.

Could you suggest please if there is something I could do with the adopath?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#13

20 Oct 2023, 18:16

Could you suggest please if there is something I could do with the adopath?

Possibly. Run -sysdir- and see where your PERSONAL adopath directory is. If you have write permission for that directory, you can go ahead and install -runby- from SSC on your own personal computer (assuming you have Stata on it). -runby- is contained in three files: runby.ado, runby_run.ado, and runby.sthlp. Copy those files to some portable media (maybe a flash drive), or email them to yourself and then download them on the other computer. If you then copy those three files into the PERSONAL directory, you will be able to use -runby-.

Unfortunately, there is no simple substitute for -runby- using only built-in Stata commands. (Obviously, -runby- itself uses built-in Stata and Mata commands--but it is a lengthy and somewhat complex program. The original version was smaller and simpler, but also less effective.)

I sympathize with the difficulties you are facing here. Network administrators face a massive challenge trying to keep their systems safe from malware and other forms of attack. They often incline to emphasize security over usability in the way they set up their systems. I can't say I blame them for doing that given the incentive structure they work under. But that doesn't make life any better for us system users. I generally turn down work that would require me to use locked systems like that. I consider myself fortunate that my position enables me to do that.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#14

21 Oct 2023, 01:48

Originally posted by Clyde Schechter View Post

Possibly. Run -sysdir- and see where your PERSONAL adopath directory is. If you have write permission for that directory, you can go ahead and install -runby- from SSC on your own personal computer (assuming you have Stata on it). -runby- is contained in three files: runby.ado, runby_run.ado, and runby.sthlp. Copy those files to some portable media (maybe a flash drive), or email them to yourself and then download them on the other computer. If you then copy those three files into the PERSONAL directory, you will be able to use -runby-.

Unfortunately, there is no simple substitute for -runby- using only built-in Stata commands. (Obviously, -runby- itself uses built-in Stata and Mata commands--but it is a lengthy and somewhat complex program. The original version was smaller and simpler, but also less effective.)

I sympathize with the difficulties you are facing here. Network administrators face a massive challenge trying to keep their systems safe from malware and other forms of attack. They often incline to emphasize security over usability in the way they set up their systems. I can't say I blame them for doing that given the incentive structure they work under. But that doesn't make life any better for us system users. I generally turn down work that would require me to use locked systems like that. I consider myself fortunate that my position enables me to do that.

Thanks Clyde! Your suggestions have been extremely helpful, as always.

It appears I don't have write permission for the directory, so I'll wait till the administrators put the ado file there.

Thanks again!
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#15

27 Oct 2023, 17:27

Originally posted by Clyde Schechter View Post

Possibly. Run -sysdir- and see where your PERSONAL adopath directory is. If you have write permission for that directory, you can go ahead and install -runby- from SSC on your own personal computer (assuming you have Stata on it). -runby- is contained in three files: runby.ado, runby_run.ado, and runby.sthlp. Copy those files to some portable media (maybe a flash drive), or email them to yourself and then download them on the other computer. If you then copy those three files into the PERSONAL directory, you will be able to use -runby-.

Unfortunately, there is no simple substitute for -runby- using only built-in Stata commands. (Obviously, -runby- itself uses built-in Stata and Mata commands--but it is a lengthy and somewhat complex program. The original version was smaller and simpler, but also less effective.)

I sympathize with the difficulties you are facing here. Network administrators face a massive challenge trying to keep their systems safe from malware and other forms of attack. They often incline to emphasize security over usability in the way they set up their systems. I can't say I blame them for doing that given the incentive structure they work under. But that doesn't make life any better for us system users. I generally turn down work that would require me to use locked systems like that. I consider myself fortunate that my position enables me to do that.

Hi Clyde, I was able to get runby and run the code, however I'm getting the following error:

Code:

r(3900) st_data(): 3900 unable to allocate real <tmp>[132180040,1] stata2mata(): - function returned error runby_main(): - function returned error <istmt>: - function returned error

Any suggestions how to approach this, thanks!
Comment

Announcement

Help with "leave one out" mode

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment