Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting occurrences where current value is equal or less than previous values

    Greetings All,

    I've been scratching my head at the below for some time, hoping someone can point me in the right direction.

    I have a daily transactional dataset with gaps. I want to see whether a product with a reference price "websiteprice" of $X on date T had an actual sold price "actualsoldprice" $Y >= $X for at least 10% of previous T – 90 days. In other words, for each transaction where "sale_at_or_above_refprice" == 1, we need to count how many times the actual sold price for prior transactions (of a given product) within the previous 90 days met or exceeded that transactions reference price. I have included first step results that I am looking for in the "wanted" column.

    My data is as follows,


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 orderdate str16 productcode str10 productcategory byte(websiteprice actualsoldprice sale_at_or_above_refprice var7 wanted)
    "3-Jan-20"  "MZZ32819-564-282" "Mens Jeans" 40 25 . .  .
    "8-Jan-20"  "MZZ32819-564-282" "Mens Jeans" 40 40 1 .  .
    "12-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 40 1 .  1
    "12-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 28 . .  .
    "18-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 24 . .  .
    "20-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 30 . .  .
    "27-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 25 . .  .
    "11-Oct-20" "MZZ32819-564-282" "Mens Jeans" 40 20 . .  .
    "19-Oct-20" "MZZ32819-564-282" "Mens Jeans" 35 24 . .  .
    "2-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 20 20 1 .  6
    "2-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 14 14 1 .  7
    "4-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 14 14 1 .  8
    "7-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 14 14 1 .  9
    "7-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 20 20 1 .  7
    "9-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 20 20 1 .  8
    "11-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 12
    "12-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 13
    "14-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 14
    "15-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 15
    "18-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 16
    "24-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 .  9
    end
    Thank you in advance,
    Adrian

  • #2
    Sorry, I had to loop. I hope someone else may have a more elegant way. The command -rangestat- would be the focus of this code. If you need to loop that through different product codes, you may have to wrap another loop around it. Also, the date should be formatted.

    Code:
    drop var7
    gen date1 = date(orderdate, "DMY", 2020)
    format date1 %td
    
    local max = _N
    gen wanted2 = .
    foreach x of numlist 1/`max'{
        capture drop get get_sum
        gen get = actualsoldprice >= actualsoldprice[`x']
        rangestat (sum) get, interval(date -90 -1)
        replace wanted2 = get_sum if _n == `x'
    }
    replace wanted2 = . if sale_at_or_above_refprice == .
    After running this, I think your "wanted" variable has some internal inconsistencies:
    First, case 2 should be 0 and not a missing. It's a sale, and there was no matching/exceeding price in -90 days.
    Second, the double cases in Nov 2 and Nov 7 need to be reconciled. They are currently treated differently. If you start counting from -1 back to -90, then it should be 6, 6, for Nov 2, and not 6, 7. If you count from today (0) back to -90 day, then Nov 7 should be 10, 7 instead of 9, 7.

    Comment


    • #3
      Dear Ken,

      Thank you very much for your response. You are right regarding "wanted" in case 2, a zero would be appropriate as you pointed out. You are also right on the second point, greatly appreciate going beyond my initial question.

      I am trying to run this for millions of observations for various product IDs (productcode). Could you help me adjust your code for this purpose? Would the addition of by() in rangestat be sufficient?
      Last edited by Adrian Robles; 03 Jun 2021, 16:37.

      Comment


      • #4
        Originally posted by Adrian Robles View Post
        I am trying to run this for millions of observations for various product IDs (productcode). Could you help me adjust your code for this purpose? Would the addition of by() in rangestat be sufficient?
        Could you use -dataex- again and provide another data set with at least 2 products?

        Comment


        • #5
          Cross-posted at https://stackoverflow.com/questions/...previous-90-da

          Please note our policy on cross-posting, which is that you are asked to tell us about it. https://www.statalist.org/forums/help#crossposting

          Comment


          • #6
            Ken - Below an updated data set with multiple products. I was also thinking about the logic and include a "new_wanted" which takes into account dates that repeat with differing prices.

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str9 orderdate str16 productcode str10 productcategory byte(websiteprice actualsoldprice sale_at_or_above_refprice wanted new_wanted)
            "3-Jan-20"  "MZZ32819-564-282" "Mens Jeans" 40 25 .  .  .
            "8-Jan-20"  "MZZ32819-564-282" "Mens Jeans" 40 40 1  0  0
            "12-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 40 1  1  1
            "12-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 28 .  .  .
            "18-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 24 .  .  .
            "20-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 30 .  .  .
            "27-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 25 .  .  .
            "11-Oct-20" "MZZ32819-564-282" "Mens Jeans" 40 20 .  .  .
            "19-Oct-20" "MZZ32819-564-282" "Mens Jeans" 35 24 .  .  .
            "2-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 20 20 1  6  6
            "2-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 14 14 1  6  6
            "4-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 14 14 1  8  7
            "7-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 14 14 1  9  8
            "7-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 20 20 1  7  7
            "9-Nov-20"  "MZZ32819-564-282" "Mens Jeans" 20 20 1  8  9
            "11-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 12 10
            "12-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 13 11
            "14-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 14 12
            "15-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 15 13
            "18-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 16 14
            "24-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1  9  9
            "6-Jan-20"  "ADDZ4449-524-645" "Mens Bags"  60 50 .  .  .
            "11-Jan-20" "ADDZ4449-524-645" "Mens Bags"  70 60 .  .  .
            "12-Feb-20" "ADDZ4449-524-645" "Mens Bags"  60 60 1  .  1
            "12-Jul-20" "ADDZ4449-524-645" "Mens Bags"  60 50 .  .  .
            "18-Sep-20" "ADDZ4449-524-645" "Mens Bags"  50 55 1  .  1
            "20-Sep-20" "ADDZ4449-524-645" "Mens Bags"  50 45 .  .  .
            "20-Sep-20" "ADDZ4449-524-645" "Mens Bags"  66 45 .  .  .
            "12-Oct-20" "ADDZ4449-524-645" "Mens Bags"  55 60 1  .  1
            "19-Oct-20" "ADDZ4449-524-645" "Mens Bags"  60 60 1  .  1
            "2-Nov-20"  "ADDZ4449-524-645" "Mens Bags"  70 73 1  .  0
            "2-Nov-20"  "ADDZ4449-524-645" "Mens Bags"  60 56 .  .  .
            "4-Nov-20"  "ADDZ4449-524-645" "Mens Bags"  60 60 1  .  3
            "7-Nov-20"  "ADDZ4449-524-645" "Mens Bags"  50 45 .  .  .
            "7-Nov-20"  "ADDZ4449-524-645" "Mens Bags"  66 66 1  .  1
            "9-Nov-20"  "ADDZ4449-524-645" "Mens Bags"  60 56 .  .  .
            "11-Nov-20" "ADDZ4449-524-645" "Mens Bags"  60 76 1  .  5
            "12-Nov-20" "ADDZ4449-524-645" "Mens Bags"  60 71 1  .  6
            "13-Nov-20" "ADDZ4449-524-645" "Mens Bags"  60 26 .  .  .
            "15-Nov-20" "ADDZ4449-524-645" "Mens Bags"  65 70 1  .  4
            "15-Nov-20" "ADDZ4449-524-645" "Mens Bags"  67 70 1  .  3
            "22-Nov-20" "ADDZ4449-524-645" "Mens Bags"  56 70 1  .  9
            end
            Nick - My apologies for not disclosing the cross-post. I will make sure to re-read forum guidelines and rules. Thank you

            Comment


            • #7
              Since the old code works I'd rather not revise it. This is a bit "brute", subsetting a new dataset for each product code, process, and then append them to a blank file to grow it. It works, but there would be a better way.

              Code:
              * Create a proper date variable
              gen date1 = date(orderdate, "DMY", 2020)
              format date1 %td
              * Save a full file for looping
              save tempfull, replace
              * Extract the unique product codes:
              levelsof productcode, local(pcode)
              * Create an empgy shell for loop appending
              drop if _n >= 1
              save tempshell, replace
              
              * Foreach loop
              foreach pc of local pcode{
              use tempfull, clear            // Call the full file
              keep if productcode == "`pc'"  // Extract unique product, one by one
              * Run the counting process
              gen wanted2 = .
              count
              foreach x of numlist 1/`r(N)'{
                  capture drop get get_sum
                  gen get = actualsoldprice >= actualsoldprice[`x']
                  rangestat (sum) get, interval(date1 -90 -1)
                  replace wanted2 = get_sum if _n == `x'
              }
              replace wanted2 = . if sale_at_or_above_refprice == .
              append using tempshell        // Append the shell
              save tempshell, replace       // Save the shell, it'd be appended again next loop
              }
              
              capture drop get get_sum      // Clean up
              erase tempfull.dta            // Erase old full file
              
              * Check results
              list date1 productcode productcategory websiteprice actualsoldprice new_wanted wanted2, sepby(productcode)
              Again, what this code got is a bit different than your "new_wanted", please check the discrepancies.

              Code:
                   +--------------------------------------------------------------------------------------+
                   |     date1        productcode   productc~y   websit~e   actual~e   new_wa~d   wanted2 |
                   |--------------------------------------------------------------------------------------|
                1. | 03jan2020   MZZ32819-564-282   Mens Jeans         40         25          .         . |
                2. | 08jan2020   MZZ32819-564-282   Mens Jeans         40         40          0         0 |
                3. | 12jan2020   MZZ32819-564-282   Mens Jeans         40         40          1         1 |
                4. | 12sep2020   MZZ32819-564-282   Mens Jeans         40         28          .         . |
                5. | 18sep2020   MZZ32819-564-282   Mens Jeans         40         24          .         . |
                6. | 20sep2020   MZZ32819-564-282   Mens Jeans         50         30          .         . |
                7. | 27sep2020   MZZ32819-564-282   Mens Jeans         50         25          .         . |
                8. | 11oct2020   MZZ32819-564-282   Mens Jeans         40         20          .         . |
                9. | 19oct2020   MZZ32819-564-282   Mens Jeans         35         24          .         . |
               10. | 02nov2020   MZZ32819-564-282   Mens Jeans         20         20          6         6 |
               11. | 02nov2020   MZZ32819-564-282   Mens Jeans         14         14          6         6 |
               12. | 04nov2020   MZZ32819-564-282   Mens Jeans         14         14          7         8 |
               13. | 07nov2020   MZZ32819-564-282   Mens Jeans         14         14          8         9 |
               14. | 07nov2020   MZZ32819-564-282   Mens Jeans         20         20          7         7 |
               15. | 09nov2020   MZZ32819-564-282   Mens Jeans         20         20          9         8 |
               16. | 11nov2020   MZZ32819-564-282   Mens Jeans         14         14         10        12 |
               17. | 12nov2020   MZZ32819-564-282   Mens Jeans         14         14         11        13 |
               18. | 14nov2020   MZZ32819-564-282   Mens Jeans         14         14         12        14 |
               19. | 15nov2020   MZZ32819-564-282   Mens Jeans         14         14         13        15 |
               20. | 18nov2020   MZZ32819-564-282   Mens Jeans         14         14         14        16 |
               21. | 24nov2020   MZZ32819-564-282   Mens Jeans         20         20          9         9 |
                   |--------------------------------------------------------------------------------------|
               22. | 06jan2020   ADDZ4449-524-645    Mens Bags         60         50          .         . |
               23. | 11jan2020   ADDZ4449-524-645    Mens Bags         70         60          .         . |
               24. | 12feb2020   ADDZ4449-524-645    Mens Bags         60         60          1         1 |
               25. | 12jul2020   ADDZ4449-524-645    Mens Bags         60         50          .         . |
               26. | 18sep2020   ADDZ4449-524-645    Mens Bags         50         55          1         0 |
               27. | 20sep2020   ADDZ4449-524-645    Mens Bags         50         45          .         . |
               28. | 20sep2020   ADDZ4449-524-645    Mens Bags         66         45          .         . |
               29. | 12oct2020   ADDZ4449-524-645    Mens Bags         55         60          1         0 |
               30. | 19oct2020   ADDZ4449-524-645    Mens Bags         60         60          1         1 |
               31. | 02nov2020   ADDZ4449-524-645    Mens Bags         70         73          0         0 |
               32. | 02nov2020   ADDZ4449-524-645    Mens Bags         60         56          .         . |
               33. | 04nov2020   ADDZ4449-524-645    Mens Bags         60         60          3         3 |
               34. | 07nov2020   ADDZ4449-524-645    Mens Bags         50         45          .         . |
               35. | 07nov2020   ADDZ4449-524-645    Mens Bags         66         66          1         1 |
               36. | 09nov2020   ADDZ4449-524-645    Mens Bags         60         56          .         . |
               37. | 11nov2020   ADDZ4449-524-645    Mens Bags         60         76          5         0 |
               38. | 12nov2020   ADDZ4449-524-645    Mens Bags         60         71          6         2 |
               39. | 13nov2020   ADDZ4449-524-645    Mens Bags         60         26          .         . |
               40. | 15nov2020   ADDZ4449-524-645    Mens Bags         65         70          4         3 |
               41. | 15nov2020   ADDZ4449-524-645    Mens Bags         67         70          3         3 |
               42. | 22nov2020   ADDZ4449-524-645    Mens Bags         56         70          9         5 |
                   +--------------------------------------------------------------------------------------+
              If you ever feel you found a solution, it'd be good netiquette to share that back as well so that different platform users will not be simultaneously spending time on it.
              Last edited by Ken Chui; 05 Jun 2021, 17:12.

              Comment


              • #8
                Ken - Thank you very much for your help.

                Unfortunately my only option as of now is to limit my dataset as much as possible to only include relevant observations and run the below code. While inefficient for large dataset, it works.

                Always open to hear any suggestions for how the process could be scaled better for large datasets.

                Credit to TheIceBear from SO.

                Code:
                *Start by converting date to Stata date
                gen stata_date = date(orderdate,"DM20Y")
                format stata_date %td
                
                *Sort data and product code as stop conditions in while loop expect them to be sorted
                sort productcode stata_date
                
                *Create varialbe to store result
                gen count_less = .
                
                *Loop over all rows
                count
                forvalue row = 1/`r(N)' {
                    
                    *Only applicable to
                    if sale_at_or_above_refprice[`row'] == 1 {
                        
                        *Set result variable to 0 for this row
                        replace count_less = 0 if _n == `row'
                        
                        *Initate locals used in while loop
                        local true = 1
                        local row_skip = 1
                        local count = 0
                        local last_date = stata_date[`row']
                        
                        *Loop until any stop condition sets local true to 0
                        while `true' == 1 {
                          
                            *Test if row_skip hits top of data set (i.e row 0)
                            if `row'-`row_skip' == 0                                        local true = 0
                            *Test that product is same in compare row
                            else if productcode[`row'] != productcode[`row'-`row_skip']     local true = 0
                            *Test that previous order is within 90 days
                            else if stata_date[`row'] - stata_date[`row'-`row_skip'] > 90   local true = 0
                
                            *Test if actualsoldprice is less thatn old websiteprice
                            else if websiteprice[`row'] <= actualsoldprice[`row'-`row_skip'] {
                                
                                * Each date can only be counted once, so test if date is last date counted
                                if `last_date' != stata_date[`row'-`row_skip'] {
                                    *Compare row fits condition, add 1 to counter
                                    local count = `count' + 1    
                                    
                                    *Update last counted date
                                    local last_date = stata_date[`row'-`row_skip']
                                }
                            }
                            *Skip one more prevuous row
                            local row_skip = `row_skip' + 1
                        }
                        *Add the count result to the result varaible for this row
                        replace count_less = `count' if _n == `row'
                    }
                }
                Last edited by Adrian Robles; 07 Jun 2021, 09:02. Reason: spelling

                Comment


                • #9
                  1. It seems to me that your code does not provide the correct output as your description. For example, in the observation 17 of your example in #6, you count up to 9, while the total number of observations with price equal or greater than its price (70) is just 6. Maybe I am missing something (?), but a clarification is still needed.

                  2. Further clarification for the cases of same date (as raised by Ken Chui in #2) has not been done either.

                  3. Without your further infomation, I have not tried to code, but do have the feeling that a loop would be required. However, the loop by each rows (observations) appears to be inefficient for the large data. A better loop might be feasible.

                  If you are still seeking a solution for this interesting puzzle, share out some more details, then suggestion might be given.

                  Comment

                  Working...
                  X