Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Summation by month + year

    Hi,

    I have converted a string variable called last_activity containing dates into a date formatted variable called last_activity_year using the following codes:

    Code:
    gen last_activity2 = date(last_activity, "MDY")
    
    format %td last_activity2
    
    gen last_activity_year = year(last_activity2)
    My data now looks something like this:

    Code:
    balance    last_activity2
    11.38    08jun1965
    1602.94    17aug1965
    1200 .       17aug1965
    12.45        17jan1966
    9.07          11apr1967
    48.83        25aug1967
    9144.33    25sep1967
    46.48        31oct1967
    38.04       19dec1967
    40 .          19dec1967
    I want to calculate the sum of balance by month and year. For example, for 19 Dec 1967, I would like to obtain:

    Code:
    38.04+40 = 78.04
    I used the following codes to obtain this but the newly created variable monthly_balance containing the sums is producing incorrect summation output.

    Code:
    gen last_activity_my=string(month(last_activity2),  "%02.0f")+"_"+string(year(last_activity2))
    
    preserve
    
    collapse (sum) monthly_balance = balance, by(last_activity_my)
    Could anyone please help to obtain the desired result? Thanks in advance.

  • #2

    A monthly date from your daily date is just

    Code:
    gen mdate = mofd(last_activity2) 
    but that looks equivalent to your "do-it-yourself" variable. So we're left with the claim that collapse produces incorrect results, which you don't substantiate.

    Comment


    • #3
      Nick wrote while I was testing the following, which should start you in a useful direction.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(balance la2)
        11.38 1985
      1602.94 2055
         1200 2055
        12.45 2208
         9.07 2657
        48.83 2793
      9144.33 2824
        46.48 2860
        38.04 2909
           40 2909
      end
      format %td la2
      
      generate lam = mofd(la2)
      format lam %tm
      list
      collapse (sum) monthly_balance = balance, by(lam)
      list, abbreviate(16)
      Code:
      . list
      
           +-------------------------------+
           | balance         la2       lam |
           |-------------------------------|
        1. |   11.38   08jun1965    1965m6 |
        2. | 1602.94   17aug1965    1965m8 |
        3. |    1200   17aug1965    1965m8 |
        4. |   12.45   17jan1966    1966m1 |
        5. |    9.07   11apr1967    1967m4 |
           |-------------------------------|
        6. |   48.83   25aug1967    1967m8 |
        7. | 9144.33   25sep1967    1967m9 |
        8. |   46.48   31oct1967   1967m10 |
        9. |   38.04   19dec1967   1967m12 |
       10. |      40   19dec1967   1967m12 |
           +-------------------------------+
      
      . collapse (sum) monthly_balance = balance, by(lam)
      
      . list, abbreviate(16)
      
           +---------------------------+
           |     lam   monthly_balance |
           |---------------------------|
        1. |  1965m6             11.38 |
        2. |  1965m8           2802.94 |
        3. |  1966m1             12.45 |
        4. |  1967m4              9.07 |
        5. |  1967m8             48.83 |
           |---------------------------|
        6. |  1967m9           9144.33 |
        7. | 1967m10             46.48 |
        8. | 1967m12             78.04 |
           +---------------------------+
      Or perhaps this is what you wanted.
      Code:
      generate lam = mofd(la2)
      format lam %tm
      list
      bysort lam (la2): egen monthly_balance = total(balance)
      list, abbreviate(16)
      Code:
      . list, abbreviate(16)
      
           +-------------------------------------------------+
           | balance         la2       lam   monthly_balance |
           |-------------------------------------------------|
        1. |   11.38   08jun1965    1965m6             11.38 |
        2. | 1602.94   17aug1965    1965m8           2802.94 |
        3. |    1200   17aug1965    1965m8           2802.94 |
        4. |   12.45   17jan1966    1966m1             12.45 |
        5. |    9.07   11apr1967    1967m4              9.07 |
           |-------------------------------------------------|
        6. |   48.83   25aug1967    1967m8             48.83 |
        7. | 9144.33   25sep1967    1967m9           9144.33 |
        8. |   46.48   31oct1967   1967m10             46.48 |
        9. |   38.04   19dec1967   1967m12             78.04 |
       10. |      40   19dec1967   1967m12             78.04 |
           +-------------------------------------------------+

      Comment


      • #4
        Nick Cox I think I have identified the source of the inaccuracy but I am not sure how to resolve it. In the Data Editor, my data looks like this:

        Code:
         balance    last_activity2
        2.81    01may1953
        1.97    28aug1954
        10.42    14apr1958
        34.90    26sep1959
        26.90    02jan1963
        11.38    08jun1965
        1602.94    17aug1965
        12.45    17jan1966
        9.07    11apr1967
        48.83    25aug1967
        9144.33    25sep1967
        46.48    31oct1967
        38.04    19dec1967
        But when I issue the dataex command, the data looks like this producing very different figures for the variable "balance":

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int balance float last_activity2
        1952 -2436
          19 -1952
          27  -627
        2739   -97
        2337  1097
         303  1985
        1563  2055
         651  2208
        4218  2657
        3177  2793
        4262  2824
        3135  2860
        2870  2909
        end
        format %td last_activity2
        label values balance balance1
        label def balance1 19 "1.97", modify
        label def balance1 27 "10.42", modify
        label def balance1 303 "11.38", modify
        label def balance1 651 "12.45", modify
        label def balance1 1563 "1602.94", modify
        label def balance1 1952 "2.81", modify
        label def balance1 2337 "26.90", modify
        label def balance1 2739 "34.90", modify
        label def balance1 2870 "38.04", modify
        label def balance1 3135 "46.48", modify
        label def balance1 3177 "48.83", modify
        label def balance1 4218 "9.07", modify
        label def balance1 4262 "9144.33", modify
        ------------------ copy up to and including the previous line ------------------

        Listed 13 out of 4759 observations




        Comment


        • #5
          Ok. It seems that the "balance" variable was in blue. Therefore, I dropped it and imported the data again. Now, "balance" is in a string format. I issued the following command to destring it encountering another error:

          Code:
           destring current_balance, replace
          current_balance contains nonnumeric characters; no replace

          Comment


          • #6


            My guess: A variable with numeric content (mostly) has been imported as string — and then pushed through -encode-, by you or someone else. The numeric results of that are worthless. The variable should have been pushed through -destring- — but with some focus on why the variable was misread as string in the first instance.

            For more, see the paper by Clyde Schechter and myself in Stata Journal 18(4).

            EDIT: you got there in the previous. So you need to inspect the variable and find the non-numeric characters.

            tab current_balance if missing(real(current_balance))
            Last edited by Nick Cox; 15 Feb 2019, 10:52.

            Comment


            • #7
              Nick Cox thank you for your response. Please see my post above yours.

              Comment


              • #8
                A comment on post #4, for the benefit of others who read this topic: That is precisely why the Statalist FAQ
                asks for example data produced with dataex.

                In this case, the problem was obvious from the dataex output, which includes complete information about aspects of the data that are often critical to answering the question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help to create a faithful representation of the example to try out their code, which in turn makes it more likely that their answer will actually work with the real data, and this is not possible from a screenshot of data.

                When asking for help with code, always show example data. When showing example data, always use dataex.
                Last edited by William Lisowski; 15 Feb 2019, 11:01.

                Comment


                • #9
                  So you have problems in your string variable current_balance. Try
                  Code:
                  browse if missing(real(current_balance))
                  to see the observations with bad data.

                  Comment


                  • #10
                    Thank you all. Figured it out. One of the numbers had a comma removing which solved the problem. I issued the browse command previously, but I somehow overlooked the comma.

                    Comment

                    Working...
                    X