Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple problems - Loop webscrapping data

    Hello Stata users,

    I am currently trying to webscrapp datas from youtube going through URLs. However, I am getting an unexpected error during the loop which prevent it from going through all URLs.

    I am using Stata 14.

    Please find below my code (don't be to harsh, I know I've been tinkering some part of it). There is obviously much more URLs but I erased most of them in my code above to make it easier to read.

    Code:
    clear
    set more off
       local D = c(current_date)
       local T = subinstr("`T'",":","_",.)
       cd U:\Audiencejeunes\Webscrapping\
       mkdir "`D' `T'"
       cd "`D' `T'"
       
    set obs 1
    g video = "YouTube"
    g vues = "10"
    g t = 0
    order video vues 
    save "VA_`c(current_date)'", replace
       
        forvalues repeat=1(1)1 {
        
    #delimit ;    
    foreach video in 2PXEUsz6wHs Am9pavV7q2g 9Z4s-bktMrY RDqrr9GapCk b5BqxjAmJ1M AE3QyMf900I Sm5Ai0WRLXw 5nCIZCdkOaY ovDcJ_MNNkE tOGNNS9s6kA Jbc_gCzzitE UDaWVM1jEXc qhPe8imp1XM Ga8Wfy-dTCQ {;   
                               
    #delimit cr                           
                                di "https://www.youtube.com/watch?v=`video'"
                                cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
    
                                
                                import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
    
                                keep if strpos(v1, "watch-view-count") 
                                gen vues = substr(v1, strpos(v1,"watch-view-count"),.)
                                                            
                                drop v1
                                keep vues
                                gen video = "`video'"
                                gen time = c(current_time)
                                gen date = c(current_date)
                            
                                
                                append using "VA_`c(current_date)'"
                                order time video vues
                                save "VA_`c(current_date)'", replace
                                
                    }
                                
                                replace t = _n
                             
                                save "VA_`c(current_date)'", replace
                                save "U:\Audiencejeunes\Webscrapping\VA_`c(current_date)'", replace
                                }
    exit, STATA
    I do know that ideally:
    - the command "forvalues repeat" is unnecessary but I didn't succeed to make it work using only "foreach in". I was using this loop with a sleep command at start to repeat the loop during the day. As I observed a variable time shift because of the time required to execute the code, I decided to use The Task Scheduler from Windows instead.
    - there should be a way of having all URLs in another file or something with which Stata could go through the same way as filling all URLs manually.

    If you could help me solving these problems too, that would be perfect, but the one which stop me is the following: " file XXXX.txt (the URL) not found".

    Click image for larger version

Name:	pbStata.PNG
Views:	1
Size:	43.1 KB
ID:	1413901

    This does not concern an URL specifically. It can appear for a video which it already webscrapped before. I couldn't identify so far the reason of such error.

    I tried to force Stata going through other values of the loop adding the command "capture" as follow:

    Code:
    di "https://www.youtube.com/watch?v=`video'"
                                cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace
    
                                
                                cap import delimited using "`video'.txt",  stringcols(_all) delimiter("þ") varn(noname) clear
    
                                cap keep if strpos(v1, "watch-view-count") 
                                cap gen vues = substr(v1, strpos(v1,"watch-view-count"),.)
                                                            
                                cap drop v1
                                cap keep vues
                                gen video = "`video'"
                                gen time = c(current_time)
                                gen date = c(current_date)
                            
                                
                                append using "VA_`c(current_date)'"
                                order time video vues
                                save "VA_`c(current_date)'", replace
    But it led me to this result:

    Click image for larger version

Name:	pbStata2.PNG
Views:	1
Size:	31.5 KB
ID:	1413900


    Same video, same time, different number of views which is impossible. I just don't get it.

    Thanks for your time!

    Best,
    Adrien.

  • #2
    Hi guys, in case someone would like to answer, I have done another post and tried to be more precise about my problem.

    https://www.statalist.org/forums/for...-stop-stata-14

    Best!
    Adrien

    Comment

    Working...
    X