Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error message during a loop "file XXX.txt not found" causing the loop to stop - STATA 14

    First of all, sorry for the double post. I have assumed that my question wasn't correctly defined preventing anyone from answering. So here is a second shot !

    What I am trying to do: I am trying to webscrapp the number of views of a given list of videos on the youtube platform throughout the day (every 30 mins).

    How am I proceeding:
    1. A loop is going through a list of URL, copying and saving the HTML code into a text file then imported into stata which keep the string in which the number of views appears.
    2. Windows' task scheduler is used in order to launch this loop every 30 mins.
    What is the problem that I am encountering: From time to time, the loop stop because stata doesn't find the txt file in which the HTML code is saved as you can see below:
    Click image for larger version

Name:	pbStata.PNG
Views:	1
Size:	68.6 KB
ID:	1413909



    Here is the code that I am using:

    Code:
    ******** Webscrapping ********
    
    clear
    set more off
       local D = c(current_date)
       local T = subinstr("`T'",":","_",.)
       cd U:\Myfile\
       mkdir "`D' `T'"
       cd "`D' `T'"
      
    set obs 1
    g video = "YouTube"
    g vues = "10"
    g t = 0
    order video vues
    save "VA_`c(current_date)'", replace
      
        forvalues repeat=1(1)1 {
    
    #delimit ;    
    foreach video in 2PXEUsz6wHs Am9pavV7q2g 9Z4s-bktMrY RDqrr9GapCk b5BqxjAmJ1M AE3QyMf900I Sm5Ai0WRLXw 5nCIZCdkOaY ovDcJ_MNNkE tOGNNS9s6kA Jbc_gCzzitE UDaWVM1jEXc qhPe8imp1XM Ga8Wfy-dTCQ
                     mz6qQALSzSA oXmo946xIYA vUO1RDgkVgE 6xqPKUx1WOI xAeusyp9wj0 2BWX2lWY584 dqve5hStevY QEeta0MRv4s us4byQZ3wtE eSRYWDyybqg OLGskr4Gzak SNJ48sQWpMw 8gGQPbS036U OofH9leYhxY
                     ZO_M5bBQedI LsCcQ-9-jIM PheVrDBDTL4 unIB1_mdzc4 ci8HQFF6d5A vBYUpkCAF-M iOrffWPE3g8 nmWMdr_Vwj8 NBJx3MK-9zY BGbqxCI5kIA YtxOzQEH6WU TdVkHOMvZDc tdjifiWQcu8 sHojJ3strP0
                     "-RC_f4oEzHc" dgA3PoNiwbY j6_Y77uWtGw 970T1Sd1thc NbC3VOOo1mo IfMuhob6EAU 9uNpkeMlQE0 2Ut97j6-lsE lJ4TMfyeyhk hkPOkWbiqMg hxiQ6M77qN0 DlEdeyd3Pic uqmSV2Wma9U T8goE0yjw2c
                      {;    
                              
    #delimit cr                          
                                di "https://www.youtube.com/watch?v=`video'"
                                cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
    
                                
                                import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
    
                                keep if strpos(v1, "watch-view-count")
                                gen vues = substr(v1, strpos(v1,"watch-view-count"),.)
                                                            
                                drop v1
                                keep vues
                                gen video = "`video'"
                                gen time = c(current_time)
                                gen date = c(current_date)
                            
                                
                                append using "VA_`c(current_date)'"
                                order time video vues
                                save "VA_`c(current_date)'", replace
                                
                    }
                                
                                replace t = _n
                            
                                save "VA_`c(current_date)'", replace
                                save "U:\Myfile\VA_`c(current_date)'", replace
                                }
    exit, STATA
    Thanks for your help!

    PS: Link of my previous post : https://www.statalist.org/forums/for...scrapping-data

  • #2
    I don't see any of the video names in the screenshotted error in your list of videos? Try adding some debug messages to pin point the issue (e.g. di "1" di "2" before and after the command you expecto fail). I assume it is the import delimited that fails?

    Comment


    • #3
      Presumably
      Code:
      file `video'.txt not found
      occurs in response to
      Code:
      di "https://www.youtube.com/watch?v=`video'"
      cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
      import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
      because the copy command failed and the file thus does not exist.

      But you are getting no indication that copy failed, and no error message to explain why, because you have blinded yourself, suppressing all output by preceding the copy command with capture, and then you have done nothing to check the return code from the copy command to determine if it worked, or to at least display the error code.

      I suspect you do not understand the implications of using capture and might benefit from rereading the output of help capture, or simply ceasing to use capture until you understand what problem is causing copy to fail.

      Once you understand why copy is failing, then you should add code to check the return code supplied by capture and execute the rest of the commands in your loop (import delimited and onward) only if the return code is zero.

      Comment


      • #4
        Hi Jesse,

        The reason why you do not see the video names is just because I have shortcutted the list of all videos (around 280) to make it easier to read.

        Sorry I do not get precisely what you would like me to do with the display command. Assuming that it is the import delimited that fails would lead to :

        Code:
        di "https://www.youtube.com/watch?v=`video'"
                                    cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
        
                                    
                                    di import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
        
                                    di keep if strpos(v1, "watch-view-count") 
                                    gen vues = substr(v1, strpos(v1,"watch-view-count"),.)
        Is that what you're expecting ?

        What I still do not understand is that the loop is perfectly working most of the time and that from time to time is leading me to the mentionned error on different videos and at different timing during the day.

        Comment


        • #5
          Hi William,

          I have also suspected the same thing about the copy command. However, as this loop has been successfully executed earlier in the day, it should use the previous version of the text file and import this version instead ?

          Anyway, as you advised it, I will delete the capture command and investigate further.

          As you might guess I've been tinkering some part of an existing code which is why the command capture is used.

          Comment


          • #6
            Code:
            di "https://www.youtube.com/watch?v=`video'"
            cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
            if _rc != 0 di _rc di "1"
            import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
            di "2"
            More along these lines

            Comment


            • #7
              It seems likely that the failing copy command has first deleted the existing text file in preparation for replacing it.

              Comment


              • #8
                Okay, I have included Jesse's lines and William's advice on capture.

                I'll be running the two version of the loop (I don't know yet what _rc is):

                Code:
                di "https://www.youtube.com/watch?v=`video'"
                cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
                                            
                if _rc != 0 di _rc di "1"
                import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
                di "2"
                and


                Code:
                di "https://www.youtube.com/watch?v=`video'"
                copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
                
                import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
                I'll keep you posted on further results. Thanks !
                Last edited by Adrien Haidar; 10 Oct 2017, 10:18.

                Comment


                • #9
                  Hi again,

                  Running the code without the cap led me to the following error this morning executing the loop for the 2nd time :

                  Click image for larger version

Name:	PbStata3.PNG
Views:	1
Size:	13.9 KB
ID:	1414037


                  I found some info:

                  "- Check the permissions of the Stata installation directory to which you are downloading into and make sure you have write permissions. If you do not know which directory Stata is installed into, open Stata and type sysdir in your Stata Command Window. The first path, STATA, is the path to the Stata installation directory.
                  - A filesystem error occurred during input or output. This typically indicates a hardware or operating system failure, although it is possible that the disk was merely full and this state was misinterpreted as an I/O error."

                  As I have write permissions and that the disk is far from being full yet I suspect therefore a hardware or operating system failure... Would you conclude the same or am I missing something?

                  Best,
                  Adrien

                  Comment


                  • #10
                    Can you run this with

                    set trace on
                    set tracedepth 1

                    (might need tracedepth 2)

                    This will give you more info on what's going on. At the moment we can't tell if it's the copy or the import command that is failing. If it's the import command, you might be able to fix the issue by adding a sleep 1000 between the copy and the import command.

                    Comment


                    • #11
                      Hi again,

                      Here is the error I get by adding your code (I used tracedepth 2).



                      From what I do understand it seems that it is the copy command which is failing right?

                      Best,
                      Adrien

                      Comment


                      • #12
                        Originally posted by Adrien Haidar View Post
                        Hi again,

                        Here is the error I get by adding your code (I used tracedepth 2).

                        [ATTACH=CONFIG]temp_8987_1507725563561_421[/ATTACH]

                        From what I do understand it seems that it is the copy command which is failing right?

                        Best,
                        Adrien
                        I cannot see your image. Can you just copy paste the output into a code block (like you did with the dofile)?

                        Comment


                        • #13
                          Here it is !

                          Code:
                          - save "VA_`c(current_date)'", replace
                          = save "VA_12 Oct 2017", replace
                          file VA_12 Oct 2017.dta saved
                          - }
                          - di "https://www.youtube.com/watch?v=`video'"
                          = di "https://www.youtube.com/watch?v=mrKEEtiPwVM"
                          https://www.youtube.com/watch?v=mrKEEtiPwVM
                          - copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
                          = copy "https://www.youtube.com/watch?v=mrKEEtiPwVM" "mrKEEtiPwVM.txt", replace
                          
                          
                          I/O error
                          
                          
                          
                          import delimited using "`video'.txt", stringcols(_all) delimiter("þ") varn(noname) clear
                          keep if strpos(v1, "watch-view-count")
                          gen vues = substr(v1, strpos(v1,"watch-view-count"),.)
                          drop v1
                          keep vues
                          gen video = "`video'"
                          gen time = c(current_time)
                          gen date = c(current_date)       
                          append using "VA_`c(current_date)'"
                          order time video vues
                          save "VA_`c(current_date)'", replace
                          }

                          Comment


                          • #14
                            Originally posted by Adrien Haidar View Post
                            Here it is !

                            Code:
                            - save "VA_`c(current_date)'", replace
                            = save "VA_12 Oct 2017", replace
                            file VA_12 Oct 2017.dta saved
                            - }
                            - di "https://www.youtube.com/watch?v=`video'"
                            = di "https://www.youtube.com/watch?v=mrKEEtiPwVM"
                            https://www.youtube.com/watch?v=mrKEEtiPwVM
                            - copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
                            = copy "https://www.youtube.com/watch?v=mrKEEtiPwVM" "mrKEEtiPwVM.txt", replace
                            
                            
                            I/O error
                            
                            
                            
                            import delimited using "`video'.txt", stringcols(_all) delimiter("þ") varn(noname) clear
                            keep if strpos(v1, "watch-view-count")
                            gen vues = substr(v1, strpos(v1,"watch-view-count"),.)
                            drop v1
                            keep vues
                            gen video = "`video'"
                            gen time = c(current_time)
                            gen date = c(current_date)
                            append using "VA_`c(current_date)'"
                            order time video vues
                            save "VA_`c(current_date)'", replace
                            }
                            Is it possible your internet connection is not up 100% of the time? Alternatively, have you tried copying to a different location? Also, try adding a "sleep 1000" before the copy command. It will slow things down, but it might fix the issue.
                            Last edited by Jesse Wursten; 12 Oct 2017, 08:29.

                            Comment


                            • #15
                              It is indeed possible that the internet connexion has few lags during the day. I do not believe that saving on another drive would solve the issue since it is the only local drive (the rest being dispatch on a network). I have tried saving on other folders which is not solving the problem.

                              As you said, I'll add a sleep command but I'll combined it with a cap command. I do not really care if there is few videos not webscrapped as long as the timing of the loop is right and that the problem does not concern a too large sample.

                              Code:
                                  di "https://www.youtube.com/watch?v=`video'"
                                                          sleep 1000
                                                          cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
                                                          di _rc
                                                          if _rc ==0 {
                                                          
                                                          import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
                              
                                                          keep if strpos(v1, "watch-view-count") 
                                                          gen vues = substr(v1, strpos(v1,"watch-view-count"),.)
                                                                                      
                                                          drop v1
                                                          keep vues
                                                          gen video = "`video'"
                                                          gen time = c(current_time)
                                                          gen date = c(current_date)
                                                      
                                                          
                                                          append using "VA_`c(current_date)'"
                                                          order time video vues
                                                          save "VA_`c(current_date)'", replace
                                              }
                              Does it seems correct to you?

                              Best,
                              Adrien

                              Comment

                              Working...
                              X