find the second maximum value in row

Chen Samulsion

Join Date: Jan 2018
Posts: 932

find the second maximum value in row

11 Sep 2018, 22:24

Dear Stata Users,

Using -rowmax()- of egen function, we can find the maximum value in varlist for each observation (row). And now I want to find the second maximum value besides the first in each row. Is there any way to solve the problem? Commands and data example are as follows. Thank you.

Code:

egen firstmax=rowmax(h1a h1b h1c h1d h1e h1f h1n h1o h1p h1q)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(h1a h1b h1c h1d h1e h1f) int h1g long h1h int h1i long h1j int h1k long h1l int h1m long(h1n h1o h1p h1q)
 40000  2000 0      0 20000 0 0 2000 0  3000    0  300     0  7000 0     0     0
     0  3000 0   1000     0 0 0    0 0     0 2220  325     0     0 0     0     0
     0 50000 0      0     0 0 0    0 0     0    0  900     0     0 0     0 20000
 30000 15000 0      0     0 0 0    0 0  2500    0    0     0     0 0     0     0
     0     0 0      0     0 0 0    0 0     0    0    0     0     0 0 50000     0
     0 50000 0  36500  2000 0 0    0 0  3420    0    0   960     0 0     0     0
  1500 30000 0      0     0 0 0    0 0     0    0 1600     0  1000 0     0     0
 50000     0 0      0     0 0 0    0 0 30000 1200    0  3600     0 0     0     0
 60000 30000 0      0     0 0 0    0 0 10000    0  630 10000 10000 0     0  5500
 70000     0 0      0   500 0 0 1500 0     0    0    0     0     0 0     0     0
 20000  4000 0      0     0 0 0 5000 0  2400    0  500     0  1000 0     0     0
  6000  5500 0      0     0 0 0    0 0  2600    0  375     0   500 0     0     0
     0     0 0      0     0 0 0    0 0     0    0    0     0     0 0  5000     0
 90000     0 0      0     0 0 0 4500 0     0    0    0     0     0 0     0     0
 40000  1800 0  60000     0 0 0    0 0     0    0  375     0     0 0     0     0
100000     0 0      0     0 0 0 5000 0     0    0  500     0     0 0     0     0
 50000 22000 0  30000     0 0 0    0 0  1800    0  600     0   200 0  2000     0
     0   500 0      0     0 0 0    0 0   500    0  500     0     0 0  1000     0
 30000 30000 0 300000  3000 0 0    0 0     0    0    0     0  1500 0     0     0
 20000     0 0      0     0 0 0  500 0   200    0    0     0     0 0     0     0
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#2

11 Sep 2018, 23:08

So, there are two obstacles to doing this. One is intrinsic to the problem, the other is having the data in wide layout. The intrinsic difficulty is that you have a series of h1* variables, but some of them play no role in the calculation of firstmax and secondmax, so that has to be accounted for somehow in the code.

This works:

Code:

// GO TO LONG LAYOUT gen long obs_no = _n reshape long h1, i(obs_no) j(suffix) string // INDICATE WHICH OBSERVATIONS (ORIGINALLY VARIABLES) // ARE INCLUDED IN THE CALCULATION gen byte include = !inlist(suffix, "g", "h", "i", "j", "k", "l", "m") // FIND THE FIST AND SECOND MAX by obs_no include (h1), sort: gen firstmax = h1[_N] by obs_no include (h1): gen secondmax = h1[_N-1] // CLOBBER FIRST AND SECOND MAXES THAT ARISE // FROM EXCLUDED VARIABLES by obs_no (include): replace firstmax = firstmax[_N] by obs_no (include): replace secondmax = secondmax[_N] // CLEAN UP drop include // AND IF THERE IS SOME REAL REASON TO GO BACK TO WIDE LAYOUT // YOU CAN NOW DO: reshape wide

I don't know what your plans for this data are, but if you have additional analysis in mind, you are most likely better off skiping the final -reshape wide- command and leaving the data in long layout. There are only a few things in Stata that can be done easily (or even at all) in wide layout; Stata is mostly optimized for working with long data. So I recommend you keep things long unless you have specific analyses that work well with wide. If, in the end, you need to return to wide layout to create a layout that is easily read by human eyes, you can do that after all the analyses are done.
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 932
#3

11 Sep 2018, 23:33

Dear Clyde Schechter , thank you very much. The data I showed above is extracted from a household income survey. Variables from h1a to h1q indicate separate income sources. I want to know each household's first and second main income sources (have the largest and second largest values). Before I open this topic, I have tried to drop the first main income sources and re-generate the secondmax variable using the -rowmax()- function, see the code below. However, I don't want to drop any variables (actually I replace it with missing values) before taking the next step. And your answer demonstrate again that maybe I have to destroy something before stepping forward. Thank you again.

Code:

egen incfirst=rowmax(h1a-h1q) /*first main income sources*/ foreach v of varlist h1a-h1r { replace `v'=. if `v'==incfirst } egen incsecond=rowmax(h1a-h1q) /*second main income sources*/
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#4

12 Sep 2018, 00:23

And your answer demonstrate again that maybe I have to destroy something before stepping forward.

I don't understand this. The code that I wrote in #2 destroys nothing at all. Every bit of information in the original data is preserved intact throughout.

By the way, the code you show in #3, in addition to destroying some of the data, is also incorrect. It will not produce the right results if the first and second largest sources of income are tied, because it will replace with missing all values that are equal to the largest source and then find the next largest remaining. But if the first and second largest are tied, then that next largest remaining source will be the third, or fourth, or conceivably even lower ranked.

Last edited by Clyde Schechter; 12 Sep 2018, 00:26.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35782

12 Sep 2018, 00:37

See https://www.stata-journal.com/sjpdf....iclenum=pr0046 for a review of working rowwise. The command rowsort helps here:

Code:

. rowsort h1*, gen(H1-H17) descending 

. list H1 H2 

     +----------------+
     |     H1      H2 |
     |----------------|
  1. |  40000   20000 |
  2. |   3000    2220 |
  3. |  50000   20000 |
  4. |  30000   15000 |
  5. |  50000       0 |
     |----------------|
  6. |  50000   36500 |
  7. |  30000    1600 |
  8. |  50000   30000 |
  9. |  60000   30000 |
 10. |  70000    1500 |
     |----------------|
 11. |  20000    5000 |
 12. |   6000    5500 |
 13. |   5000       0 |
 14. |  90000    4500 |
 15. |  60000   40000 |
     |----------------|
 16. | 100000    5000 |
 17. |  50000   30000 |
 18. |   1000     500 |
 19. | 300000   30000 |
 20. |  20000     500 |
     +----------------+

As the maximum occurs in different variables (and is not guaranteed to occur just once) dropping the variable with the maximum is not going to work.

In strategic terms I agree with Clyde. Long layout is generally preferable in Stata.

Comment

Chen Samulsion

Join Date: Jan 2018

Posts: 932
#6

12 Sep 2018, 01:50

Dear Mr. Clyde Schechter, I am sorry, the word 'destory' was misused, I actually mean some 'dirty work' (maybe also misused) such as dropping certain variable or changing data format.

Nick Cox, thank you for introducing -rowsort-, it works very well. However, there's something wrong with the option -descending-, it should be wrote in brief as -descend-. (installed through SSC, and version information *! NJC 1.2.0 22 November 2005)

And just as what you've said in rowsort remarks:

rowsort loops over observations and may be relatively slow. It may be faster to reshape, sort within blocks, and reshape again.

there's a consensus on how to find the first and second maximum value between you and Schechter. And my solution in #3 ignores the tied data.

Last edited by Chen Samulsion; 12 Sep 2018, 01:56.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35782
#7

12 Sep 2018, 02:03

My code in #5 was an exact copy of code that works. As flagged by the link, I am using rowsort from the Stata Journal which starts

Code:

*! NJC 2.0.0 30 January 2009 * NJC 1.2.0 22 November 2005 * NJC 1.1.0 21 November 2000 program rowsort version 9 syntax varlist [if] [in], Generate(str) [ Descending HIGHmissing ]

There is no reason to use the older rowsort from SSC unless you're using Stata 7 or 8.
Comment
Nazli Uludere Aragon

Join Date: Dec 2016

Posts: 9
#8

23 Jul 2025, 11:06

Hello all! Could rowsort handle 5204 vars? I tried "rowsort *, generate(s1-s5204) descending" for 5204 vars (there is a reason why I need to keep all) but get an error (r101).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#9

23 Jul 2025, 12:29

After creating a data set with 5,204 variables, I ran the comand -rowsort *, gen(s1-s5204) descending- and I cannot reproduce the error reported in #6.

My attempt did additional produce an error message, but that was about the number of variables allowed (error code 900). In my setup (version 19.5 MP4), by default the maximum number of variables allowed is 5,000. So even creating the initial data set with 5,204 triggered that error. In order to run this code, maxvar must be set large enough to allow not just for the original 5,204 variables but also for the 5,204 new variables to be created, and also for some additional variables that are created along the way. (I don't know exactly how many such variables must be allowed for. Just running it with maxvar = 10,408 triggers the error, but with maxvar = 12,000 it runs just fine.)

So I am puzzled what caused the syntax error referred to. Possibilities that come to mind are:
The command actually run was some mistyped version of what is shown in #6.

The command actually run was copied from some other source that resulted in non-printing characters being inserted into the command. In that case, the solution is to delete the command, and rewrite it by hand without re-copying from the original source.

The version of -rowsort- being used is somehow corrupted, or perhaps out of date. The most recent version is version 2.0.0, from 30 Jan 2009. Consider re-installing it.

I am also a bit surprised by a data set containing a series of 5,204 variables. Though there are exceptional circumstances where this is the best way to organize the data, in nearly all settings this kind of data would be better handled in long layout, as most Stata commands will work better (or only) with long data. Consider -reshape-ing the data to long.

Finally, it should be mentioned that -rowsort- is not an official Stata command. It is written by Nick Cox and is available from SSC. In future posts mentioning user-written commands, please identify the author(s) and indicate where the code can be found.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1480
#10

23 Jul 2025, 12:44

I think the November 2005 version of rowsort cannot handle that many variables, but the January 2009 version can. The former is what gets installed if you do

Code:

ssc install rowsort

You can install the newer version by instead doing

Code:

net install pr0046.pkg, replace

Edit: crossed with #9. With 5,204 variables, and with the old version of the command, even with set maxvar adjusted to a high enough level, I got the error:

Code:

invalid numlist has too many elements r(123);

Last edited by Hemanshu Kumar; 23 Jul 2025, 12:50.
Comment
Nazli Uludere Aragon

Join Date: Dec 2016

Posts: 9
#11

23 Jul 2025, 13:16

Thanks Clyde and Hemanshu for the helpful feedback. I am using the rowsort version Nick and Hemanshu referred (pr0046.pkg). My stata is 14.2MP and I did set maxvar to max (>32000).

The reason for the large number of vars is I am analyzing biophysical data (area by soil type by parcel). In this case, I have several states worth of data. It amounts to >5200 total soil types (blame the soil survey) for over 300,000 parcels / land units. The goal is to sort for the top 3 soil type by parcel, and see if only using top 3 soil types captures representative area (>80%) for parcels (rather than trying to account for all soil types with non-zero area). Sorry for the boring detail! I suppose I can split this up into 2-3 chunks and sort, and repeat the sort on sorted. I'll let you know when I figure it out. Thanks again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#12

23 Jul 2025, 14:18

I think it will be much easier if you change the data set to long. I'll assume your data set has only the 5204 soil-type variables, which I'll assume are named x1 through x5204.

Code:

gen `c(obs_t)' parcel = _n reshape long x, i(parcel) j(soil_type_num) drop if missing(x) by obs_no (x), sort: gen byte top_three = (_n >= _N-2)

If you have no need at this point for the values other than the top-three for each parcel you can just -keep if top_three-. If you will still need all the other variables, you can always single out the top-three values in a command by their non-zero value of variable top_three. The -reshape- command will probably be slow, but I suspect it will be no slower than the -rowsort- command, and quite possibly substantially faster. I recommend re-saving the data under a new filename (never overwrite original data!) after the -drop if missing(x)- command so that you can use this long-layout version of the data for other analyses. Almost anything you will want to do with this data will work better (or only) with the long layout.
Comment
Nazli Uludere Aragon

Join Date: Dec 2016

Posts: 9
#13

23 Jul 2025, 18:02

Clyde thanks a lot. That is a great idea. I'll give it a shot!
Comment

Announcement

find the second maximum value in row

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment