Alternatives to using split and then reshape long?

David Yu

Join Date: Nov 2022

Posts: 1
#1

Alternatives to using split and then reshape long?

03 Nov 2022, 18:41

I am trying to separate out data into rows, but without using split and reshape. Consider the following toy dataset.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float person_id str10 income 1 "10, 20, 50" 2 "35, 20" 3 "" 4 "40" 5 "34, 24" end

I would like to separate out the income column using commas as delimiters, then get it into the following long form.
person_id income

1 10

1 20

1 50

2 35

2 20

4 40

5 34

5 24

Typically, I would run

Code:

split income, parse(,) generate(inc) destring drop income reshape long inc, i(person_id) j(obs_id) drop if inc == . drop obs_id

However, I am in a situation with lots of data (instead of five people, we have 800k observations, and instead of a maximum of three values to be separated out, we have up to 200). My attempts at applying the above method (using Stata MP, Version 16) to this large dataset have so far been futile (in the sense that the reshape runs for hours without finishing). I am wondering if there is a better (more memory-efficient) way to do this.

For example, can I directly split my income values into rows? Or, is there a way in which I can ignore missing values when doing reshape? In my full dataset, most observations have only a few values in the income column, so that inc200 is almost all missing values. So if there was some way to preemptively drop them during the reshape, I think that could work as well.

Open to any other suggestions! Thanks everyone.

Last edited by David Yu; 03 Nov 2022, 18:43.
Tags: data management, large data, reshape, split
Clyde Schechter

Join Date: Apr 2014

Posts: 30193
#2

03 Nov 2022, 19:07

I don't know of any way to get -reshape- to skip over missing observations.

I do know that the slowness of -reshape- is a frequent concern of Stata users. There are at least three user-written programs that can -reshape- wide to long much faster than StataCorp's official command. They are Sergio Correa's -freshape-, from SSC, Mauricio Caceres' -greshape-, available as part of the -gtools- suite at github.com/mcaceresb/stata-gtools, and Rafal Raciborski's -tolong-, from SSC. I have never looked into which of these is fastest. I suspect the answer differs according to various parameters of the problem itself. Nevertheless, any one of them will be much faster than -reshape- in any really large data set.

If none of those speeds things up enough for you, I have another suggestion. No matter which -reshape- command you use, it is going to be stuck dealing with all those surplus inc* variables that are, you say, almost always just missing. So here's an approach that deals with that by "removing the alligator eggs upstream."

Code:

capture program drop one_person_id program define one_person_id split income, parse(,) generate(inc) destring drop income reshape long inc, i(person_id) j(obs_id) // OR USE -freshape-, -greshape, OR -tolong- FOR EVEN FASTER RESULTS drop if missing(inc) exit end runby one_person_id, by(person_id) status

The trick here is that -runby- deals with only one person_id's data at a time, so it only creates 200 inc* variables if the id actually has a value of income that includes 200 entries. The actual number of inc* variables created will just be the maximum number of entries in an inc observation for that one person_id. I have never tried this approach myself, because I've never had to deal with this particular situation. But I'm pretty sure it will markedly speed up the processing, even a noticeable boost on top of switching from -reshape- to -freshape-, -greshape-, or -tolong-.

Added: Forgot to mention that -runby- is written by Robert Picard and me, and is available from SSC. I suggest testing the approach first on a relatively small subset of the real data just to make sure that program one_person_id is working as you expect it to. While doing that, replace the -status- option in -runby- with -verbose-. That way if there are errors along the way, you will see the error messages, which makes debugging a whole lot easier. Once you are confident that program one_person_id is correct, then get rid of -verbose- and go back to -status- for your production run with the full data set. The latter causes -runby- to give you a periodic progress report, including an estimate of time remaining.

Last edited by Clyde Schechter; 03 Nov 2022, 19:33.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35810

04 Nov 2022, 12:01

I agree with Clyde Schechter as usual.

FWIW, I am attributed as the author of split (version 8 on) although upstream of that Michael Blasnik did a lot of work on a joint command and StataCorp haven't let the code lie as I submitted it. But the code was never tested in my time on datasets where hundreds of variables would be the result.

On reshape long I can claim no credit or share no blame. I do recall that in a problem where reshape long was taking an age I did something like this, which doesn't hinge on creating hundreds of new variables first. There were good speed-ups.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float person_id str10 income
1 "10, 20, 50"
2 "35, 20"    
3 ""          
4 "40"        
5 "34, 24"    
end

replace income = subinstr(income, ",", " ", .)
replace income = "." if income == ""
gen toexpand = wordcount(income)
expand toexpand 

bysort person_id : gen wanted = real(word(income, _n))

list , sepby(person_id)

     +-------------------------------------------+
     | person~d       income   toexpand   wanted |
     |-------------------------------------------|
  1. |        1   10  20  50          3       10 |
  2. |        1   10  20  50          3       20 |
  3. |        1   10  20  50          3       50 |
     |-------------------------------------------|
  4. |        2       35  20          2       35 |
  5. |        2       35  20          2       20 |
     |-------------------------------------------|
  6. |        3            .          1        . |
     |-------------------------------------------|
  7. |        4           40          1       40 |
     |-------------------------------------------|
  8. |        5       34  24          2       34 |
  9. |        5       34  24          2       24 |
     +-------------------------------------------+

Announcement

Alternatives to using split and then reshape long?

Comment

Comment