Converting from long to wide format

William Mitchell

Join Date: Apr 2020
Posts: 57

Converting from long to wide format

12 Aug 2020, 20:33

Dear all,

I'm trying to convert from long to wide format:

Code:

reshape wide _proc_code _proc_eye _slt_eye _age istent hydrus istent_add goniotomy trabectome concur_phaco ab_int_canal other_glau_proc_code other_glau_proc_date other_glau_proc_eye, i(patient_guid) j(_proc_date)

However I'm being met with the following error code:

Code:

macro substitution results in line that is too long
    The line resulting from substituting macros would be longer than allowed.  The
    maximum allowed length is 264,408 characters, which is calculated on the basis of
    set maxvar.

    You can change that in Stata/SE and Stata/MP.  What follows is relevant only if you
    are using Stata/SE or Stata/MP.

    The maximum line length is defined as 16 more than the maximum macro length, which
    is currently 264,392 characters.  Each unit increase in set maxvar increases the
    length maximums by 129.  The maximum value of set maxvar is 32,767.  Thus, the
    maximum line length may be set up to 4,227,159 characters if you set maxvar to its
    largest value.
r(920);

The data set is large (~20 million rows) so I'm not sure whether that is part of the problem?
Or whether the number of variables I want to include under each "_proc_date" is the problem (i.e. the 14 variables: _proc_code _proc_eye _slt_eye _age istent hydrus istent_add goniotomy trabectome concur_phaco ab_int_canal other_glau_proc_code other_glau_proc_date other_glau_proc_eye)?

Any help would be so appreciated; thanks for your time.

Will

Tags: None

William Mitchell

Join Date: Apr 2020

Posts: 57
#2

12 Aug 2020, 20:44

Hi all;

I'm getting an error message that looks like it may be communicating a similar sentiment, using a different data set (similarly large):

Code:

reshape wide eye iop, i(patient_guid) j(_result_date)

Code:

characteristic contents too long The maximum value of the contents is 67,784. r(1004);

Looking forward to any advice, thanks so much again.
William
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#3

12 Aug 2020, 22:08

It's the combination of the number of variables and the number of distinct values of _result_date that is breaking the code, specifically their product.

If your command were just -reshape wide a b c, i(patient_guid) j(_result_date)- and if result_date just had 4 values, then the resulting wide data set would need to hold 3x4 = 12 variables (plus patient_guid and any other variables not mentioned in the command). Now, I don't know how many values _result_date has, but if your total data set has 20,000,000 observations ( we don't say "rows" in Stata--that's spreadsheet language), and assuming the entities denoted by patient_guid are human patients, it is likely that the number of patients is, at most, several tens of thousands. Which means you have probably 1,000 or so values of _result_date. Multiply that by 14 variables and you need to create ~14,000 new variables. Stata needs to build a list of those variable names when it does -reshape- and it initially stores that in a local macro, and subsequently, in a data set characteristic. But a list of 14,000 variable names, which on average will be, say 15 characters long, with some spaces separating them is just exceeding the amount of storage Stata can give to those macros and characteristics.

The best solution to this problem is not to do it in the first place. It is really inconceivable to me that were you to successfully reshape this data set wide that you would be able to do anything useful with it. You're actually lucky that Stata didn't waste your time cranking out this data set (which would have taken it hours if not days to do) only to leave you with a useless mess. Most Stata commands are designed to work with data in long layout. There are a few exceptions, but none of them that I am aware of would be applicable to a data set with this many variables: they are typically commands that use a pair of variables, or a small number of pairs of variables. I can pretty much guarantee you that whatever you are planning to do with this data will be much better done if you leave it long. In fact, if you did succeed in getting this data set into wide layout, probably your next post here would be some dilemma on how to do something, and the first step in the solution would be to -reshape- it back to long.

Last edited by Clyde Schechter; 12 Aug 2020, 22:12.
3 likes
Comment
Arthur Morris

Join Date: Apr 2014

Posts: 107
#4

12 Aug 2020, 22:09

The solution may be as simple as following the directions in the error message. See this documentation for how to give Stata permission to use more memory.

Beyond that we are hamstrung by lack of information about the data you are using.

It seems that you are asking Stata to create a very large number of variables. Do you intend to do this?

If _proc_date is a daily date (meaning that it takes on a new value for each day of the year) and at least one procedure occurs each day then you are asking Stata to create 365 variables for each stub in your stublist for each year in your sample. This means that the code in #2 creates up to 720 variables per year, while the code in #1 creates up to 5,475 variables per year.

It's worth reading the [reshape FAQ](https://www.stata.com/support/faqs/d...-with-reshape/) for additional insight into the command.
1 like
Comment
Arthur Morris

Join Date: Apr 2014

Posts: 107
#5

12 Aug 2020, 22:13

I agree with Clyde Schechter, especially this:

Originally posted by Clyde Schechter View Post

The best solution to this problem is not to do it in the first place. It is really inconceivable to me that were you to successfully reshape this data set wide that you would be able to do anything useful with it. You're actually lucky that Stata didn't waste your time cranking out this data set (which would have taken it hours if not days to do) only to leave you with a useless mess. Most Stata commands are designed to work with data in long layout. There are a few exceptions, but none of them that I am aware of would be applicable to a data set with this many variables: they are typically commands that use a pair of variables, or a small number of pairs of variables. I can pretty much guarantee you that whatever you are planning to do with this data will be much better done if you leave it long.
Comment
William Mitchell

Join Date: Apr 2020

Posts: 57
#6

13 Aug 2020, 15:18

Dear Clyde, and Arthur,

Thanks so much for your reply, I really appreciate the time you've taken.

I hadn't conceptualised what changing from long --> wide would do to the data, but I understand that 20 million rows, spanning 6 years, would result in >thousands of unique dates - which would mean 14*>thousands of new variables (columns).. which would have made the data virtually un-interpretable for me to continue using. So it's actually a really good result/safety mechanism, that Stata hasn't made the conversion for me. Thanks.

I am trying to do a time-till-event analysis, with some patients taking up one row, and some multiple rows (defined by a unique ID).
I had thought the data should be in wide format, but understand that this is not feasible here.
I will continue exploring ways to do this in long format, instead.

Thanks again for your time. William
1 like
Comment

Announcement

Converting from long to wide format

Comment

Comment

Comment

Comment

Comment