Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating a user journey - can Stata do this?

    Hi all,

    I think I've hit a bit of a limit with my Stata knowledge, and I'm hoping that you may be able to help me (or guide in terms of resources/help guides!).

    In short, I have data from a website. This has a timestamp recorded for each user, when they land on a particular page. For some, this is quite simple - they'll go from page 1, to page 2, to page 3, and then to page 4 (as with Andrew in the code below). For others, they may go to a later page before an earlier one (as with James, below, who has gone to Page 2 before Page 1) or they'll just land on a few pages (Manisha and Lizzie).

    Code:
     
    Name Page 1 Page 2 Page 3 Page 4
    Andrew 1200 1300 1400 1500
    James 1300 1100 1600 1730
    Lizzie 1000 1900
    Manisha 830 1000
    What I'd like to do is to be able to show how many users took a particular path. This may be through creating dummy variables for each of the timestamps (which I can do with the !missing function). However, the struggle is how to then have them in sequential order - so, being able to show the total users who took a particular route through the site.

    I hope the above is somewhat clear. If not, I'm more than happy to clarify! Hopefully this is something that Stata can do, but I'm far from being able to think which function would best work here. Grateful for any help!







  • #2
    This will do it:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str7 name int(page1 page2 page3 page4)
    "Andrew"  1200 1300 1400 1500
    "James"   1300 1100 1600 1730
    "Lizzie"  1000    . 1900    .
    "Manisha"    .  830    . 1000
    end
    
    reshape long page, i(name) j(page_num)
    rename page timestamp
    by name (timestamp), sort: gen visit_order = _n
    reshape wide page_num timestamp, i(name) j(visit_order)
    egen path = concat(page_num*), punct(", ")
    In the future, when showing data examples, please use the -dataex- command to do so, as I have done in this response. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Hi Clyde,

      My sincere apologies for the delay in replying to you, I've been a bit unwell. This is incredibly, incredibly helpful. Do you have a PayPal/Amazon account I can send you $10 or something as a small token of thanks? As an aside, this works fantastically on the test data that I posted but the format of my timestamps is: "2018-07-18T00:19:42.849108+02:00" (everything after the '.' following 42 can be dropped) so this isn't working. I've spent quite a bit of time with the datetime help file, but am lost with regard to turning this into a readable date and time. Any advice at all?

      Again, huge thanks.

      Comment


      • #4
        Dates and times are complicated. It takes a lot of working with them to fully get the hang of it. And even those of us who use them often still have to refer back to the documentation frequently. There's just so much detail to keep track of. Here's how to convert this to a numeric variable that will work with the code: evidently you will need to do this separately for each of your timestamp variable. The code here just illustrates the technique for a single variable starting from the format in your example in #5:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str33 string_date_time
        "2018-07-18T00:19:42.849108+02:00 "
        end
        
        gen cleaned = substr(string_date_time, 1, 19)
        replace cleaned = subinstr(cleaned, "T", " ", 1)
        gen double timestamp = clock(cleaned, "YMDhms")
        format timestamp %tc
        Important: Don't forget the -double- in that -gen double timestamp = ...- command. If you leave that out, you will get a float, which does not have enough bits to contain a date-time variable without loss of precision!

        Moral of the story: when posting example data, it is always best to make it an excerpt from your real data if that is possible. When you post "simplified" data you run the risk that you will get code that capitalizes on the simplification and then does not work with your real data. If there are restrictions or other circumstances that preclude posting excerpts of real data, then it is best to just make a few minor numerical or spelling modifications to an excerpt of real data but preserve all of the formatting and layout exactly as it is.

        I appreciate the thought, but no compensation is necessary. Everybody who responds here does it for whatever personal satisfaction they derive from it. I do it because I enjoy both the puzzle-solving and the teaching, and because I feel good about "paying forward" in return for all that I have learned from other Statalist responders over the (just shy of) 24 years I have been participating.

        If you really feel compelled to expend some money in return for the advice, then give a donation to your favorite charity.
        Last edited by Clyde Schechter; 25 Jul 2018, 18:29.

        Comment


        • #5
          Hi Clyde,

          That's very generous of you. I've got the code working with my dataset (attached) but it doesn't seem to be functioning correctly. For example, timestamp1 is first for id=5 - however, it's showing as 'page 3' in the sequence. I'm not sure I understand this correctly.

          Also, is it possible to remove missing data (pages) from the path variable? For example, if a user only visited pages 1 and 2, the path variable would read '1,2' and not '1,2,3,4'?

          Thank you again for all of your help!

          ---
          Do-File code:

          Code:
          reshape long pg, i(id) j(page_num)
          rename pg timestamp
          by id (timestamp), sort: gen visit_order = _n
          reshape wide page_num timestamp, i(id) j(visit_order)
          egen path = concat(page_num*), punct(", ")
          Attached Files
          Last edited by Calum Andrew; 26 Jul 2018, 07:46.

          Comment


          • #6
            Some small changes:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str7 name int(page1 page2 page3 page4)
            "Andrew"  1200 1300 1400 1500
            "James"   1300 1100 1600 1730
            "Lizzie"  1000    . 1900    .
            "Manisha"    .  830    . 1000
            end
            
            reshape long page, i(name) j(page_num)
            rename page timestamp
            by name (timestamp), sort: gen visit_order = _n
            drop if timestamp==.
            reshape wide page_num timestamp, i(name) j(visit_order)
            egen path = concat(page_num*), punct(", ")
            replace path = substr(path,1,strpos(path,", .")-1) if strpos(path, ", .")
            But I dont understand the issue with the incorrect ordering. I dont see how that would happen with currnet code.
            It is not possible to troubleshoot this issue with your attachment. We'd have to see the original data (before you run your code) to understand whats going on.
            And note, please attach data examples using dataex, not in any other fomrat. See the FAQ for why and how: https://www.statalist.org/forums/help#stata

            Comment


            • #7
              And here's another way to get the path without including the extraneous non-visited pages:

              Code:
              reshape long page, i(name) j(page_num)
              rename page timestamp
              drop if missing(timestamp)
              by name (timestamp), sort: gen visit_order = _n
              by name (timestamp): gen path = string(page_num) if _n == 1
              by name (timestamp): replace path = path[_n-1] + ", " + string(page_num) if _n > 1
              by name (timestamp): replace path = path[_N]
              reshape wide page_num timestamp, i(name) j(visit_order)
              I strongly agree with Jorrit that I see no reason why things should be in the wrong order, and that there is no way to troubleshoot it without seeing the actual exact code you are using (inside code delimiters) and an example of some actual data that reproduces this problem (using -dataex-).

              Comment


              • #8
                Hi all,

                Thank you for flagging the dataex function, it's very neat. Please find the output below, followed by the code from the Do File. Any guidance would be greatly appreciated.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float id str32(OnboardingDate Th1QuizBaselineDate Th3ms1Firstvisit Th1ms2Firstvisit)
                 29 "2018-06-25T22:48:35.730243+02:00" ""                                 ""                                 ""                                
                126 "2018-06-25T23:18:48.269011+02:00" "2018-06-25T23:24:26.119907+02:00" "2018-06-25T23:30:05.611560+02:00" ""                                
                 98 "2018-06-25T23:13:22.981299+02:00" ""                                 ""                                 ""                                
                 65 "2018-06-25T14:03:33.289884+02:00" ""                                 "2018-06-30T19:43:04.678026+02:00" ""                                
                 66 "2018-06-25T14:03:45.497520+02:00" ""                                 ""                                 ""                                
                 33 "2018-06-25T22:07:56.771278+02:00" "2018-06-25T22:35:32.232071+02:00" "2018-06-25T22:18:58.338577+02:00" "2018-06-25T22:58:13.599613+02:00"
                125 "2018-06-25T17:30:32.238432+02:00" ""                                 ""                                 ""                                
                 82 "2018-06-25T22:41:31.952479+02:00" "2018-06-25T22:46:58.305440+02:00" ""                                 ""                                
                 24 "2018-06-25T15:40:25.432287+02:00" "2018-06-25T15:49:20.314320+02:00" ""                                 ""                                
                142 "2018-06-26T00:17:25.997888+02:00" ""                                 ""                                 ""                                
                 90 "2018-06-26T14:41:49.765142+02:00" ""                                 ""                                 ""                                
                124 "2018-06-26T07:05:27.846883+02:00" ""                                 ""                                 ""                                
                 71 "2018-06-26T12:17:47.611387+02:00" "2018-06-26T12:22:36.150699+02:00" ""                                 ""                                
                  9 "2018-06-26T10:34:02.455977+02:00" ""                                 ""                                 ""                                
                 77 "2018-06-26T08:31:16.544415+02:00" ""                                 ""                                 ""                                
                 78 "2018-06-26T12:54:06.669053+02:00" ""                                 "2018-06-26T14:31:53.635498+02:00" ""                                
                 47 "2018-06-26T11:57:42.167899+02:00" "2018-06-26T12:04:26.058325+02:00" "2018-06-26T12:13:46.742616+02:00" ""                                
                145 "2018-06-26T16:37:28.578593+02:00" ""                                 ""                                 ""                                
                 10 "2018-06-26T12:30:09.098120+02:00" ""                                 ""                                 ""                                
                 72 "2018-06-26T03:47:28.730880+02:00" "2018-06-26T03:44:19.432865+02:00" ""                                 ""                                
                end
                Do-File:

                Code:
                //Now, need to create new timestamps for each one
                //OnboardingDate
                gen cleanedOnboard = substr(OnboardingDate, 1, 19)
                replace cleanedOnboard = subinstr(cleanedOnboard, "T", " ", 1)
                gen double OnboardTime = clock(cleanedOnboard, "YMDhms")
                format OnboardTime %tc
                drop cleanedOnboard
                //Th1QuizBaselineDate
                gen cleanedTh1QuizBaseline = substr(Th1QuizBaselineDate, 1, 19)
                replace cleanedTh1QuizBaseline = subinstr(cleanedTh1QuizBaseline, "T", " ", 1)
                gen double Th1QuizBaselinetime = clock(cleanedTh1QuizBaseline, "YMDhms")
                format Th1QuizBaselinetime %tc
                drop cleanedTh1QuizBaseline
                //Th1ms2Firstvisit
                gen cleanedTh1ms2Firstvisit = substr(Th1ms2Firstvisit, 1, 19)
                replace cleanedTh1ms2Firstvisit = subinstr(cleanedTh1ms2Firstvisit, "T", " ", 1)
                gen double Th1ms2Firstvisittime = clock(cleanedTh1ms2Firstvisit, "YMDhms")
                format Th1ms2Firstvisittime %tc
                drop cleanedTh1ms2Firstvisit
                //Th3ms1Firstvisit
                gen cleanedTh3ms1Firstvisit = substr(Th3ms1Firstvisit, 1, 19)
                replace cleanedTh3ms1Firstvisit = subinstr(cleanedTh3ms1Firstvisit, "T", " ", 1)
                gen double Th3ms1Firstvisittime = clock(cleanedTh3ms1Firstvisit, "YMDhms")
                format Th3ms1Firstvisittime %tc
                drop cleanedTh3ms1Firstvisit
                //Drop original vars and rename
                drop OnboardingDate Th1QuizBaselineDate Th1ms2Firstvisit Th3ms1Firstvisit
                rename OnboardTime pg1
                rename Th1ms2Firstvisittime pg2
                rename Th1QuizBaselinetime pg3
                rename Th3ms1Firstvisittime pg4
                //Code to map user journey
                reshape long pg, i(id) j(page_num)
                rename pg timestamp
                by id (timestamp), sort: gen visit_order = _n
                reshape wide page_num timestamp, i(id) j(visit_order)
                egen path = concat(page_num*), punct(", ")

                Comment


                • #9
                  I tried the original code for generating the path with your new example data, and there were no examples of anything coming out in the wrong order. The code below is the same as in #7, but with the variable names to reflect the names in your new data example. I believe this code works. If you find cases where it produces incorrect results, please post back with examples of that.

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float id str32(OnboardingDate Th1QuizBaselineDate Th3ms1Firstvisit Th1ms2Firstvisit)
                   29 "2018-06-25T22:48:35.730243+02:00" ""                                 ""                                 ""                                
                  126 "2018-06-25T23:18:48.269011+02:00" "2018-06-25T23:24:26.119907+02:00" "2018-06-25T23:30:05.611560+02:00" ""                                
                   98 "2018-06-25T23:13:22.981299+02:00" ""                                 ""                                 ""                                
                   65 "2018-06-25T14:03:33.289884+02:00" ""                                 "2018-06-30T19:43:04.678026+02:00" ""                                
                   66 "2018-06-25T14:03:45.497520+02:00" ""                                 ""                                 ""                                
                   33 "2018-06-25T22:07:56.771278+02:00" "2018-06-25T22:35:32.232071+02:00" "2018-06-25T22:18:58.338577+02:00" "2018-06-25T22:58:13.599613+02:00"
                  125 "2018-06-25T17:30:32.238432+02:00" ""                                 ""                                 ""                                
                   82 "2018-06-25T22:41:31.952479+02:00" "2018-06-25T22:46:58.305440+02:00" ""                                 ""                                
                   24 "2018-06-25T15:40:25.432287+02:00" "2018-06-25T15:49:20.314320+02:00" ""                                 ""                                
                  142 "2018-06-26T00:17:25.997888+02:00" ""                                 ""                                 ""                                
                   90 "2018-06-26T14:41:49.765142+02:00" ""                                 ""                                 ""                                
                  124 "2018-06-26T07:05:27.846883+02:00" ""                                 ""                                 ""                                
                   71 "2018-06-26T12:17:47.611387+02:00" "2018-06-26T12:22:36.150699+02:00" ""                                 ""                                
                    9 "2018-06-26T10:34:02.455977+02:00" ""                                 ""                                 ""                                
                   77 "2018-06-26T08:31:16.544415+02:00" ""                                 ""                                 ""                                
                   78 "2018-06-26T12:54:06.669053+02:00" ""                                 "2018-06-26T14:31:53.635498+02:00" ""                                
                   47 "2018-06-26T11:57:42.167899+02:00" "2018-06-26T12:04:26.058325+02:00" "2018-06-26T12:13:46.742616+02:00" ""                                
                  145 "2018-06-26T16:37:28.578593+02:00" ""                                 ""                                 ""                                
                   10 "2018-06-26T12:30:09.098120+02:00" ""                                 ""                                 ""                                
                   72 "2018-06-26T03:47:28.730880+02:00" "2018-06-26T03:44:19.432865+02:00" ""                                 ""                                
                  end
                  
                  //OnboardingDate
                  gen cleanedOnboard = substr(OnboardingDate, 1, 19)
                  replace cleanedOnboard = subinstr(cleanedOnboard, "T", " ", 1)
                  gen double OnboardTime = clock(cleanedOnboard, "YMDhms")
                  format OnboardTime %tc
                  drop cleanedOnboard
                  //Th1QuizBaselineDate
                  gen cleanedTh1QuizBaseline = substr(Th1QuizBaselineDate, 1, 19)
                  replace cleanedTh1QuizBaseline = subinstr(cleanedTh1QuizBaseline, "T", " ", 1)
                  gen double Th1QuizBaselinetime = clock(cleanedTh1QuizBaseline, "YMDhms")
                  format Th1QuizBaselinetime %tc
                  drop cleanedTh1QuizBaseline
                  //Th1ms2Firstvisit
                  gen cleanedTh1ms2Firstvisit = substr(Th1ms2Firstvisit, 1, 19)
                  replace cleanedTh1ms2Firstvisit = subinstr(cleanedTh1ms2Firstvisit, "T", " ", 1)
                  gen double Th1ms2Firstvisittime = clock(cleanedTh1ms2Firstvisit, "YMDhms")
                  format Th1ms2Firstvisittime %tc
                  drop cleanedTh1ms2Firstvisit
                  //Th3ms1Firstvisit
                  gen cleanedTh3ms1Firstvisit = substr(Th3ms1Firstvisit, 1, 19)
                  replace cleanedTh3ms1Firstvisit = subinstr(cleanedTh3ms1Firstvisit, "T", " ", 1)
                  gen double Th3ms1Firstvisittime = clock(cleanedTh3ms1Firstvisit, "YMDhms")
                  format Th3ms1Firstvisittime %tc
                  drop cleanedTh3ms1Firstvisit
                  //Drop original vars and rename
                  drop OnboardingDate Th1QuizBaselineDate Th1ms2Firstvisit Th3ms1Firstvisit
                  rename OnboardTime pg1
                  rename Th1ms2Firstvisittime pg2
                  rename Th1QuizBaselinetime pg3
                  rename Th3ms1Firstvisittime pg4
                  //Code to map user journey
                  reshape long pg, i(id) j(page_num)
                  rename pg timestamp
                  drop if missing(timestamp)
                  by id (timestamp), sort: gen visit_order = _n
                  by id (timestamp): gen path = string(page_num) if _n == 1
                  by id (timestamp): replace path = path[_n-1] + ", " + string(page_num) if _n > 1
                  by id (timestamp): replace path = path[_N]
                  reshape wide page_num timestamp, i(id) j(visit_order)

                  Comment


                  • #10
                    Hi Clyde and Jorrit,

                    Apologies, I was mis-reading some of the output. You're both right, the code is working exactly as expected - which is fantastic!

                    Just to clarify, is the page_num variable linked to the initial variable before the reshape? So, for example, the page_num2 that's generated is pg2 (originally Th1ms2... in this example)?

                    Thank you!

                    Comment


                    • #11
                      I'm not sure I understand your question.

                      In the results, the value of page_num2 is the number of the page that was second on that user's path. So if page_num2 = 3, that means that the time stamp originally shown for pg3 turned out to be the second one when things were put in chronological order.

                      Does that answer your question?

                      Comment


                      • #12
                        I think so! Just to confirm, with a path of '1,4,3,2' (as in the case of id33 using the above data and code), this means the user went from page 1 (variable pg1, originally OnboardTime) to page 4 (pg4; Th3ms1Firstvisittime), then to page 3 (pg3; Th1QuizBaselinetime) and finally to page 2 (pg2; Th1ms2Firstvisittime)?

                        Comment


                        • #13
                          That's right.

                          Comment

                          Working...
                          X