Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting imported data in panel data format

    I have a question regarding structuring panel data in Stata, and preparing the data in a panel structure to run the appropriate analysis. The tables under are shown in excel, however are imported directly into Stata for the sorting.


    Problem and description inn data:


    Tables:


    Click image for larger version

Name:	Tables.png
Views:	2
Size:	64.0 KB
ID:	1377550


    Prod_id refers to the product id being offered by the store, at a given price, at a given date. For example, the store Brio offers product 10 for 1999 on 06.01.2017 (Table 1). This means that the next date on the same product by the same shop (Brio) will be a referred to as a price change, here 1899 n 10.01.2017.

    Table 1 depicts the actual shape the inn data has in excel, such that it can be imported directly to Stata. Table 2 refers to the panel structure needed from table 1, where the minimum price of each product, from each shop is given for each date. Table 3 is simply table 2, with prices being constant between dates with price changes. Color code orange and brown indicate the same product with different variants being offered on the same shop on the same dates. Grey indicates a restart of the date cycle, i.e. another variant of the same product by the same shop, while yellow indicate a change to another product. Their content is explained in correct context in the following sentences.

    Main issues: (And what I have tried)

    The first issue is constructing in table 2, a pre-specified date period (here 02.01.2017-15.01.2017) such that all prices from all shops, per product can be filled inn, i.e. to get it in the format of table 2.

    The main issue I have however, is sorting it in Stata after importing table 1 from excel. I try using the reshape, however this I think fails due to two complexities that occur in the structure of inn data in panel 1. These are when two (or more) prices occur on the same date, for the same shop. An example of this is marked in orange and brown in table 1. The orange case is discussed first:

    When two prices occur due to product cycle restart (date starting over for same shop)
    In orange, it is the same shop, Brio, that is offering two variants of the same product 10, where the grey area indicates a restart of the date, i.e. another variant of the same product. The collision occurs on 10.01.2017 with price 1899, and for the next variant at 2100, on 10.01.2017.

    When two prices occur due to the same shop changing the price on the same day:
    This is marked by the brown area in table 1, showing the shop Leke, changing their price from 1999 to 2034 on the same date 12.01.2017

    What I need help with:

    I need to sort the data in Stata from table 1, in a form shown in table 2. Table 2, shows the minimum values for each shop at each date, given each product, such that if there are complexities with two prices colliding (as stated in the issues, marked by the orange&brown area), the minimum of those is chosen for the sorting. Table 2 also needs to construct a prespecified date period for each period (here 02.01.2017-15.01.2017) such that all price observations can be included.

    Table 3 is the last step, which is table 2, with all the prices in between two price changes, being constant. The yellow area indicates that the product changes from 10 to another one, 12, such that the filling out of prices has to be subject to the prespecified date period for each product, given each shop.

    Summary

    A panel data setup is needed from table 1, shown in table 2, where first a prespecified date period must be constructed, and then the minimum prices per date – per product – per shop be sorted in, before blank dates are filled out with the previous price. I have tried in Stata using reshape, but due to complexities, am not sure of how to do it. The original data also has much more shops, longer periods of time, such that it is preferable to do this in Stata, as excel the manual process is not feasible for such a large set.


    Attached Files

  • #2
    Cross-posted at http://stackoverflow.com/questions/4...nel-data-stata and already commented on there.

    Please note our cross-posting policy which is that you should tell us about it:

    http://www.statalist.org/forums/help#crossposting

    You should also heed the request for readable data examples. It's predictable that no one will want to re-create your data examples by typing the data in, especially there is extra uncertainty about what variable names, types, etc. the data will adopt on import to Stata.

    http://www.statalist.org/forums/help#stata esp 12.2
    Last edited by Nick Cox; 09 Mar 2017, 04:45.

    Comment


    • #3
      I apologize for not including the cross-posting, as my first post here I think a reminder popped up about the cross posting (or it was in the FAQ). Anyway, apologies, as I am a fairly new member, and I will fix this asap. I will also include the file from table 1 in Stata, another mistake I should've avoided.

      The following attachment provides table 1 inn data in a Stata format.
      Last edited by Dave Bennett; 09 Mar 2017, 06:10.

      Comment


      • #4
        Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int(date price) str4 store byte prod_id
        20825 1999 "Brio" 10
        20829 1899 "Brio" 10
        20829 1890 "Brio" 10
        20831 1934 "Brio" 10
        20829 1941 "Leke" 10
        20830 1929 "Leke" 10
        20831 2029 "Leke" 10
        20832 1995 "Leke" 10
        20825 3780 "Brio" 12
        20829 3787 "Brio" 12
        20829 3788 "Brio" 12
        20831 3699 "Brio" 12
        20829 3999 "Leke" 12
        20830 3939 "Leke" 12
        20831 3939 "Leke" 12
        20832 3890 "Leke" 12
        end
        format %tdnn/dd/CCYY date

        Comment


        • #5
          So I set this problem aside this morning and went about my day job. Returning to it now, its clear that the sample data at the link in post #3 is sorted in a different order than the presentation in Table 1 of post #1.

          No, I take that back. I realize now it simply does not match; Table 1 shows 6 observations for Product 10 in store Brio; the sample data only has 4 observations for that combination. And Table 1 has 9 observations from Brio and 7 from Leke while the sample data has 8 from each store.

          Could you correct your sample data to match Table 1 and repost it using dataex as in post #4?

          And can you tell us exactly order the data are meant to be in? Table 1 appears to be by prod_id, then by store, then by the invisible "version" that we deduce whenever a new price occurs on an earlier date than the previous row for the same combination of product and store.

          Comment


          • #6
            I must have made another error while importing the file in Stata. Attached to this post is the correct Stata file, matching exact table 1. The data are supposed to be in the exact order as presented in the attached Stata file, or as shown in table 1. To clarify, your interpretation of the order is exact.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input int(date price) str4 store byte prod_id
            20825 1999 "Brio" 10
            20829 1899 "Brio" 10
            20831 1934 "Brio" 10
            20823 2200 "Brio" 10
            20829 2100 "Brio" 10
            20832 2200 "Brio" 10
            20827 2099 "Leke" 10
            20831 1999 "Leke" 10
            20831 2034 "Leke" 10
            20833 2300 "Leke" 10
            20821 2999 "Brio" 12
            20827 2899 "Brio" 12
            20831 2934 "Brio" 12
            20827 3200 "Leke" 12
            20831 3100 "Leke" 12
            20833 3200 "Leke" 12
            20834 3099 "Leke" 12
            end
            format %tdnn/dd/CCYY date

            Comment


            • #7
              Here is some code to paste into Stata's Do-file Editor and run, reviewing the output carefully, and availing yourself of the help command and Stata's PDF documentation where the meaning of the commands are unclear.
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input int(date price) str4 store byte prod_id
              20825 1999 "Brio" 10
              20829 1899 "Brio" 10
              20831 1934 "Brio" 10
              20823 2200 "Brio" 10
              20829 2100 "Brio" 10
              20832 2200 "Brio" 10
              20827 2099 "Leke" 10
              20831 1999 "Leke" 10
              20831 2034 "Leke" 10
              20833 2300 "Leke" 10
              20821 2999 "Brio" 12
              20827 2899 "Brio" 12
              20831 2934 "Brio" 12
              20827 3200 "Leke" 12
              20831 3100 "Leke" 12
              20833 3200 "Leke" 12
              20834 3099 "Leke" 12
              end
              format %tdnn/dd/CCYY date
              // change date format to match Excel tables
              format %tdDD.NN.CCYY date
              // data assertedly sorted by prod_id/store/version (version invisible)
              // "sort" the data so Stata knows it's sorted
              sort prod_id store, stable
              // find versions within each store
              by prod_id store: generate byte version = sum(_n==1 | date < date[_n-1])
              order prod_id store version date
              list, sepby(prod_id store version)
              // find lowest price by product/store/version/date
              sort prod_id store version date price
              by prod_id store version date: drop if _n>1
              list, sepby(prod_id store version)
              // expand to every day
              egen spv = group(prod_id store version)
              xtset spv date
              tsfill
              by spv: replace prod_id = L.prod_id   if prod_id==.
              by spv: replace store   = store[_n-1] if store==""
              by spv: replace version = L.version   if version==.
              by spv: replace price   = L.price     if price==.
              drop spv
              list, sepby(prod_id store version)
              // find least expensive version of product by store/date
              sort prod_id store date price
              by prod_id store date: drop if _n>1
              list, sepby(prod_id store)
              // reshape
              drop version
              reshape wide price, i(prod_id date) j(store) string
              rename (price*) (*)
              list, sepby(prod_id)
              The result from the final list command approximates your Table 3, except that, as you do not extend the initial observations back to the beginning of time, I do not extend the final observations out to the end of time. I argue that in the absence of data to support such extensions in either direction, this is in general a preferable approach.
              Code:
                   +------------------------------------+
                   | prod_id         date   Brio   Leke |
                   |------------------------------------|
                1. |      10   04.01.2017   2200      . |
                2. |      10   05.01.2017   2200      . |
                3. |      10   06.01.2017   1999      . |
                4. |      10   07.01.2017   1999      . |
                5. |      10   08.01.2017   1999   2099 |
                6. |      10   09.01.2017   1999   2099 |
                7. |      10   10.01.2017   1899   2099 |
                8. |      10   11.01.2017   1899   2099 |
                9. |      10   12.01.2017   1934   1999 |
               10. |      10   13.01.2017   2200   1999 |
               11. |      10   14.01.2017      .   2300 |
                   |------------------------------------|
               12. |      12   02.01.2017   2999      . |
               13. |      12   03.01.2017   2999      . |
               14. |      12   04.01.2017   2999      . |
               15. |      12   05.01.2017   2999      . |
               16. |      12   06.01.2017   2999      . |
               17. |      12   07.01.2017   2999      . |
               18. |      12   08.01.2017   2899   3200 |
               19. |      12   09.01.2017   2899   3200 |
               20. |      12   10.01.2017   2899   3200 |
               21. |      12   11.01.2017   2899   3200 |
               22. |      12   12.01.2017   2934   3100 |
               23. |      12   13.01.2017      .   3100 |
               24. |      12   14.01.2017      .   3200 |
               25. |      12   15.01.2017      .   3099 |
                   +------------------------------------+
              Last edited by William Lisowski; 10 Mar 2017, 07:46.

              Comment


              • #8
                I ran the do file and it worked perfectly. Thank you very much, for a well executing code. When it comes to extending the initial observations back to beginning of time, has to do with (forgive me if I did not mention this), the products being introduced at those dates by different shops. The extending of time final observations out to the end of time is necessary in order to compare the shops at different dates. Here it is also assumed (again, forgive me if I did not mention this), that prices continue constant, out to the end of time, from the last observation.

                In all essence, the way the ''pre specified'' period is decided, is choosing dates such that price changes that shops make (observations in table 1), are captured by the table. And to compare shops, it is assumed that from the last observation in a shop, it continues to the end of time, til an ending date such that price changes of all shops are captured. It is therefore, that table 3 is specified the way it is (here, starting dates should have been 04.01.2017, so the table is correct).

                Comment


                • #9
                  It would not be difficult to add some code to extend each product out to the date of the last entry for that product; you may even have done so already. Without testing, it would be something like
                  Code:
                  by prod_id (date): replace Brio = Brio[_n-1] if Brio==. and Brio[_n-1]!=.
                  with a similar line for Leke. With a large number of stores, a foreach loop would be helpful.

                  It would be more complicated to extend each product out to the date of the last entry for any product. Because in your sample data, the last entry for both products is 15.01.2017, I cannot tell which you intend.

                  Put another way, is there a separate "end of time" for each product, or a single "end for time" for all products?

                  I will say that, without knowing the details of your analysis, I can imagine situations in which the assumption that a product once offered continues to be offered at the last offer price until the end of time would be difficult to support. This is particularly true given the possibility of "multiple versions" of the same product. My experience is that all too often manufacturers introduce an "improved" version of a product at the same or higher cost, and stores lower the price of the former version to quickly clear out their inventory of the former version. Assuming that the former version remains available at the clearance price ad infinitum misstates the actual lowest cost of that product from that store.

                  Again, this is for you to ponder knowing your research area. But constructing multiple choice purchase models needs the confidence that the choices faced by the purchaser are real.
                  Last edited by William Lisowski; 10 Mar 2017, 13:08.

                  Comment


                  • #10
                    Ah, yes indeed. There is not a separate end time for each product, as there is a single ''end of time'' for all products. By this I mean, that in order to capture all shops that offer the product, they will all have the same ending date, with their independent last observation, being the end value (i.e. constant from the last observation til end date). So for a given product, different shops will have different last observations (price changes), but all shops per product will have the same end date, and repeating the same process for each product will give the same end date for all products.

                    The underlying analysis is suppose to be done for different time periods. First it will be done for 2 months, such that only data for january& february is chose, but all will have the end date of 28/02/2017, regardless of when in february their last price change is.
                    Last edited by Dave Bennett; 10 Mar 2017, 13:36.

                    Comment


                    • #11
                      OK, revised code (revisions marked in red) that expands each series to the same "end of time" which I arbitrarily chose to be 18.01.2017 for this example.
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input int(date price) str4 store byte prod_id
                      20825 1999 "Brio" 10
                      20829 1899 "Brio" 10
                      20831 1934 "Brio" 10
                      20823 2200 "Brio" 10
                      20829 2100 "Brio" 10
                      20832 2200 "Brio" 10
                      20827 2099 "Leke" 10
                      20831 1999 "Leke" 10
                      20831 2034 "Leke" 10
                      20833 2300 "Leke" 10
                      20821 2999 "Brio" 12
                      20827 2899 "Brio" 12
                      20831 2934 "Brio" 12
                      20827 3200 "Leke" 12
                      20831 3100 "Leke" 12
                      20833 3200 "Leke" 12
                      20834 3099 "Leke" 12
                      end
                      format %tdnn/dd/CCYY date
                      // change date format to match Excel tables
                      format %tdDD.NN.CCYY date
                      // data assertedly sorted by prod_id/store/version (version invisible)
                      // "sort" the data so Stata knows it's sorted
                      sort prod_id store, stable
                      // find versions within each store
                      by prod_id store: generate byte version = sum(_n==1 | date < date[_n-1])
                      order prod_id store version date
                      list, sepby(prod_id store version)
                      // find lowest price by product/store/version/date
                      sort prod_id store version date price
                      by prod_id store version date: drop if _n>1
                      list, sepby(prod_id store version)
                      // expand data out to end of time (arbitrarily chosen as 18.01.2017 for this example)
                      generate copies = 1
                      by prod_id store version: replace copies = date[_n+1]-date
                      by prod_id store version: replace copies = td(18.01.2017)-date+1 if _n==_N
                      expand copies, generate (copied)
                      sort prod_id store version date copied
                      by prod_id store version date: replace date = date[_n-1]+1 if copied
                      drop copies copied
                      list, sepby(prod_id store version)
                      // find lowest price by product/store/date
                      sort prod_id store date price
                      by prod_id store date: drop if _n>1
                      list, sepby(prod_id store)
                      // reshape
                      drop version
                      reshape wide price, i(prod_id date) j(store) string
                      rename (price*) (*)
                      list, sepby(prod_id)
                      Code:
                           +------------------------------------+
                           | prod_id         date   Brio   Leke |
                           |------------------------------------|
                        1. |      10   04.01.2017   2200      . |
                        2. |      10   05.01.2017   2200      . |
                        3. |      10   06.01.2017   1999      . |
                        4. |      10   07.01.2017   1999      . |
                        5. |      10   08.01.2017   1999   2099 |
                        6. |      10   09.01.2017   1999   2099 |
                        7. |      10   10.01.2017   1899   2099 |
                        8. |      10   11.01.2017   1899   2099 |
                        9. |      10   12.01.2017   1899   1999 |
                       10. |      10   13.01.2017   1899   1999 |
                       11. |      10   14.01.2017   1899   1999 |
                       12. |      10   15.01.2017   1899   1999 |
                       13. |      10   16.01.2017   1899   1999 |
                       14. |      10   17.01.2017   1899   1999 |
                       15. |      10   18.01.2017   1899   1999 |
                           |------------------------------------|
                       16. |      12   02.01.2017   2999      . |
                       17. |      12   03.01.2017   2999      . |
                       18. |      12   04.01.2017   2999      . |
                       19. |      12   05.01.2017   2999      . |
                       20. |      12   06.01.2017   2999      . |
                       21. |      12   07.01.2017   2999      . |
                       22. |      12   08.01.2017   2899   3200 |
                       23. |      12   09.01.2017   2899   3200 |
                       24. |      12   10.01.2017   2899   3200 |
                       25. |      12   11.01.2017   2899   3200 |
                       26. |      12   12.01.2017   2899   3100 |
                       27. |      12   13.01.2017   2899   3100 |
                       28. |      12   14.01.2017   2899   3100 |
                       29. |      12   15.01.2017   2899   3099 |
                       30. |      12   16.01.2017   2899   3099 |
                       31. |      12   17.01.2017   2899   3099 |
                       32. |      12   18.01.2017   2899   3099 |
                           +------------------------------------+
                      Last edited by William Lisowski; 10 Mar 2017, 18:59. Reason: Corrected the second replace command in the red text

                      Comment


                      • #12
                        Worked perfectly. Thank you William!

                        Comment

                        Working...
                        X