Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Alternatives to using split and then reshape long?

    I am trying to separate out data into rows, but without using split and reshape. Consider the following toy dataset.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float person_id str10 income
    1 "10, 20, 50"
    2 "35, 20"    
    3 ""          
    4 "40"        
    5 "34, 24"    
    end
    I would like to separate out the income column using commas as delimiters, then get it into the following long form.
    person_id income
    1 10
    1 20
    1 50
    2 35
    2 20
    4 40
    5 34
    5 24












    Typically, I would run
    Code:
    split income, parse(,) generate(inc) destring
    drop income
    reshape long inc, i(person_id) j(obs_id)
    drop if inc == .
    drop obs_id
    However, I am in a situation with lots of data (instead of five people, we have 800k observations, and instead of a maximum of three values to be separated out, we have up to 200). My attempts at applying the above method (using Stata MP, Version 16) to this large dataset have so far been futile (in the sense that the reshape runs for hours without finishing). I am wondering if there is a better (more memory-efficient) way to do this.

    For example, can I directly split my income values into rows? Or, is there a way in which I can ignore missing values when doing reshape? In my full dataset, most observations have only a few values in the income column, so that inc200 is almost all missing values. So if there was some way to preemptively drop them during the reshape, I think that could work as well.

    Open to any other suggestions! Thanks everyone.
    Last edited by David Yu; 03 Nov 2022, 18:43.

  • #2
    I don't know of any way to get -reshape- to skip over missing observations.

    I do know that the slowness of -reshape- is a frequent concern of Stata users. There are at least three user-written programs that can -reshape- wide to long much faster than StataCorp's official command. They are Sergio Correa's -freshape-, from SSC, Mauricio Caceres' -greshape-, available as part of the -gtools- suite at github.com/mcaceresb/stata-gtools, and Rafal Raciborski's -tolong-, from SSC. I have never looked into which of these is fastest. I suspect the answer differs according to various parameters of the problem itself. Nevertheless, any one of them will be much faster than -reshape- in any really large data set.

    If none of those speeds things up enough for you, I have another suggestion. No matter which -reshape- command you use, it is going to be stuck dealing with all those surplus inc* variables that are, you say, almost always just missing. So here's an approach that deals with that by "removing the alligator eggs upstream."

    Code:
    capture program drop one_person_id
    program define one_person_id
        split income, parse(,) generate(inc) destring
        drop income
        reshape long inc, i(person_id) j(obs_id) // OR USE -freshape-, -greshape, OR -tolong- FOR EVEN FASTER RESULTS
        drop if missing(inc)
        exit
    end
    
    runby one_person_id, by(person_id) status
    The trick here is that -runby- deals with only one person_id's data at a time, so it only creates 200 inc* variables if the id actually has a value of income that includes 200 entries. The actual number of inc* variables created will just be the maximum number of entries in an inc observation for that one person_id. I have never tried this approach myself, because I've never had to deal with this particular situation. But I'm pretty sure it will markedly speed up the processing, even a noticeable boost on top of switching from -reshape- to -freshape-, -greshape-, or -tolong-.

    Added: Forgot to mention that -runby- is written by Robert Picard and me, and is available from SSC. I suggest testing the approach first on a relatively small subset of the real data just to make sure that program one_person_id is working as you expect it to. While doing that, replace the -status- option in -runby- with -verbose-. That way if there are errors along the way, you will see the error messages, which makes debugging a whole lot easier. Once you are confident that program one_person_id is correct, then get rid of -verbose- and go back to -status- for your production run with the full data set. The latter causes -runby- to give you a periodic progress report, including an estimate of time remaining.
    Last edited by Clyde Schechter; 03 Nov 2022, 19:33.

    Comment


    • #3
      I agree with Clyde Schechter as usual.

      FWIW, I am attributed as the author of split (version 8 on) although upstream of that Michael Blasnik did a lot of work on a joint command and StataCorp haven't let the code lie as I submitted it. But the code was never tested in my time on datasets where hundreds of variables would be the result.

      On reshape long I can claim no credit or share no blame. I do recall that in a problem where reshape long was taking an age I did something like this, which doesn't hinge on creating hundreds of new variables first. There were good speed-ups.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float person_id str10 income
      1 "10, 20, 50"
      2 "35, 20"    
      3 ""          
      4 "40"        
      5 "34, 24"    
      end
      
      replace income = subinstr(income, ",", " ", .)
      replace income = "." if income == ""
      gen toexpand = wordcount(income)
      expand toexpand 
      
      bysort person_id : gen wanted = real(word(income, _n))
      
      list , sepby(person_id)
      
           +-------------------------------------------+
           | person~d       income   toexpand   wanted |
           |-------------------------------------------|
        1. |        1   10  20  50          3       10 |
        2. |        1   10  20  50          3       20 |
        3. |        1   10  20  50          3       50 |
           |-------------------------------------------|
        4. |        2       35  20          2       35 |
        5. |        2       35  20          2       20 |
           |-------------------------------------------|
        6. |        3            .          1        . |
           |-------------------------------------------|
        7. |        4           40          1       40 |
           |-------------------------------------------|
        8. |        5       34  24          2       34 |
        9. |        5       34  24          2       24 |
           +-------------------------------------------+

      Comment

      Working...
      X