Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem running reshape command

    Hello everyone,

    I am using Stata 13.1. I have a large dataset and I am attempting to use reshape to get it into a format in which I can run further analysis. Currently, I have a "customer" variable with roughly 3.3 million observations, as well as "unit_sold" and "age" variables. For each customer, variables are assigned based on whether a unit was sold in a particular year with a 1, and the customer's age in that year (see code below). Ultimately I would like to sum the units sold for each single year of age, by year.

    Here is the original form of my dataset:
    Code:
        +----------------------------------+
        customer   unit_s~d   age   year
        ----------------------------------
    1.    2550          1    38   2001
    2.    2550          1    39   2002
    3.    2558          1    64   2001
    4.    2558          1    66   2003
    5.    2558          1    67   2004
        ----------------------------------
    6.    3486          1    29   2002
    7.    3486          1    30   2003
        +----------------------------------+
    I ran the command
    Code:
    reshape wide unit_sold age, i (customer) j (year)

    Code:
        +-------------------------------------------------------------------------+
        customer   u_s~2000   age2000   u_s~2001   age2001   u_s~2002   age2002
        -------------------------------------------------------------------------
    1.    4781          1        25          .         .          1        27
    2.    5221          1        33          1        34          1        35
    3.    84612          1        36          1        37          1        38
    4.    94675          1        48          1        49          1        50
        +-------------------------------------------------------------------------+
    I now have a single row for each customer (which I can ultimately drop), and columns for individual years. However, I've been struggling to run another reshape long command combined with collapse, in order to get my data into this form:

    Code:
         +--------------------------------------+
         | age   uni~2001   uni~2002   uni~2003 |
         |--------------------------------------|
      1. |  16       8465       7548       7201 |
      2. |  17       9682       9284       8621 |
      3. |  18       3784       4551       4620 |
      4. |  19       7986       7496       6633 |
      5. |  20       7845       8121       8305 |
         +--------------------------------------+
    When I run
    Code:
    reshape long age, i (customer)
    Stata generates a "_j" variable that contains all years in the dataset for each customer, and then recreates the table with a "1" to indicate a unit sold for a particular year in EACH variable column. The result being, that if I run
    Code:
    collapse (sum) unit_sold*, by (age)
    I end up with astronomically high numbers.

    Any guidance would be greatly appreciated. I am a new Stata user and have tried to do as much research as possible before posting. Thanks in advance.
    Last edited by Chris Henderson; 22 Sep 2014, 07:37.

  • #2
    I stopped at

    Ultimately I would like to sum the units sold for each single year of age, by year.
    as to me that implies collapse, not reshape.

    Code:
    collapse (sum) unit_sold, by(age year)

    Comment


    • #3
      You are correct, and thank you for the reply. However, my problem is with using reshape to first get my dataset into the form where I can run
      Code:
      collapse (sum) unit_sold, by (age year)
      Which I have already done in current form. The problem with the output after running that command is explained above.
      Last edited by Chris Henderson; 22 Sep 2014, 07:34.

      Comment


      • #4
        I don't see why you need a reshape at all.

        Comment


        • #5
          I need to reshape wide so that the columns are my year variables. However, Stata will not allow me to reshape wide without also including the age variable, which I then need to place back in long form before running collapse.

          I apologize if I'm not being clear with my question, the vocabulary is still pretty new to me. With my data in current form, here is a sample output using collapse:


          Code:
          list
          
               +----------------------------------+
               | customer   unit_s~d   year   age |
               |----------------------------------|
            1. |     2500          1   2008    15 |
            2. |     2500          1   2009    16 |
            3. |     2551          1   2007    22 |
            4. |     2551          1   2008    23 |
            5. |     2551          1   2009    24 |
               |----------------------------------|
            6. |     2552          1   2005    46 |
            7. |     2552          1   2006    47 |
               +----------------------------------+
          
          
          collapse (sum) unit_sold, by (age year)
          
          list
              +-----------------------+
               | year   age   unit_s~d |
               |-----------------------|
            1. | 2008    15       9648 |
            2. | 2009    15        634 |
            3. | 2010    15       1286 |
            4. | 2007    16       6775 |
            5. | 2008    16       9300 |
               |-----------------------|
            6. | 2009    16        497 |
               +-----------------------+
          This is what I'm shooting for:
          Code:
          list
          
               +--------------------------------------+
               | age   uni~2000   uni~2001   uni~2002 |
               |--------------------------------------|
            1. |  15       6588       4691       4332 |
            2. |  16       8455       7669       5248 |
            3. |  17       6803       6923       6589 |
            4. |  18       9644       8791       8865 |
               +--------------------------------------+

          Comment


          • #6
            The final data structure you specify would make more things more difficult than it would make easier. Most panel data problems in Stata are made more difficult by having a variable for each unit of time, here year.

            Your examples are difficult to follow because the data are different. The three successive listings do not appear compatible.

            But if you insist on it you can reshape after the collapse.

            Code:
             
            collapse (sum) unit_sold , by(age year) 
            reshape wide unit_sold, i(age) j(year)
            As said, the data structure produced by collapse alone I would regard as best for most conceivable purposes.

            Comment


            • #7
              Thank you, using reshape after collapse made things much clearer. Working with this particular demographic dataset, I do need to have it in the form of the final table I posted, and your code worked.

              Thanks again

              Comment


              • #8
                Remember the advice about preferring long data structure for panel data, which is often given in this forum.

                Comment

                Working...
                X