Hello Stata users,
I am currently trying to webscrapp datas from youtube going through URLs. However, I am getting an unexpected error during the loop which prevent it from going through all URLs.
I am using Stata 14.
Please find below my code (don't be to harsh, I know I've been tinkering some part of it). There is obviously much more URLs but I erased most of them in my code above to make it easier to read.
I do know that ideally:
- the command "forvalues repeat" is unnecessary but I didn't succeed to make it work using only "foreach in". I was using this loop with a sleep command at start to repeat the loop during the day. As I observed a variable time shift because of the time required to execute the code, I decided to use The Task Scheduler from Windows instead.
- there should be a way of having all URLs in another file or something with which Stata could go through the same way as filling all URLs manually.
If you could help me solving these problems too, that would be perfect, but the one which stop me is the following: " file XXXX.txt (the URL) not found".

This does not concern an URL specifically. It can appear for a video which it already webscrapped before. I couldn't identify so far the reason of such error.
I tried to force Stata going through other values of the loop adding the command "capture" as follow:
But it led me to this result:

Same video, same time, different number of views which is impossible. I just don't get it.
Thanks for your time!
Best,
Adrien.
I am currently trying to webscrapp datas from youtube going through URLs. However, I am getting an unexpected error during the loop which prevent it from going through all URLs.
I am using Stata 14.
Please find below my code (don't be to harsh, I know I've been tinkering some part of it). There is obviously much more URLs but I erased most of them in my code above to make it easier to read.
Code:
clear set more off local D = c(current_date) local T = subinstr("`T'",":","_",.) cd U:\Audiencejeunes\Webscrapping\ mkdir "`D' `T'" cd "`D' `T'" set obs 1 g video = "YouTube" g vues = "10" g t = 0 order video vues save "VA_`c(current_date)'", replace forvalues repeat=1(1)1 { #delimit ; foreach video in 2PXEUsz6wHs Am9pavV7q2g 9Z4s-bktMrY RDqrr9GapCk b5BqxjAmJ1M AE3QyMf900I Sm5Ai0WRLXw 5nCIZCdkOaY ovDcJ_MNNkE tOGNNS9s6kA Jbc_gCzzitE UDaWVM1jEXc qhPe8imp1XM Ga8Wfy-dTCQ {; #delimit cr di "https://www.youtube.com/watch?v=`video'" cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace import delimited using "`video'.txt", stringcols(_all) delimiter("þ") varn(noname) clear keep if strpos(v1, "watch-view-count") gen vues = substr(v1, strpos(v1,"watch-view-count"),.) drop v1 keep vues gen video = "`video'" gen time = c(current_time) gen date = c(current_date) append using "VA_`c(current_date)'" order time video vues save "VA_`c(current_date)'", replace } replace t = _n save "VA_`c(current_date)'", replace save "U:\Audiencejeunes\Webscrapping\VA_`c(current_date)'", replace } exit, STATA
- the command "forvalues repeat" is unnecessary but I didn't succeed to make it work using only "foreach in". I was using this loop with a sleep command at start to repeat the loop during the day. As I observed a variable time shift because of the time required to execute the code, I decided to use The Task Scheduler from Windows instead.
- there should be a way of having all URLs in another file or something with which Stata could go through the same way as filling all URLs manually.
If you could help me solving these problems too, that would be perfect, but the one which stop me is the following: " file XXXX.txt (the URL) not found".
This does not concern an URL specifically. It can appear for a video which it already webscrapped before. I couldn't identify so far the reason of such error.
I tried to force Stata going through other values of the loop adding the command "capture" as follow:
Code:
di "https://www.youtube.com/watch?v=`video'" cap copy "https://www.youtube.com/watch?v=`video'" "`video'.txt", replace cap import delimited using "`video'.txt", stringcols(_all) delimiter("þ") varn(noname) clear cap keep if strpos(v1, "watch-view-count") cap gen vues = substr(v1, strpos(v1,"watch-view-count"),.) cap drop v1 cap keep vues gen video = "`video'" gen time = c(current_time) gen date = c(current_date) append using "VA_`c(current_date)'" order time video vues save "VA_`c(current_date)'", replace
Same video, same time, different number of views which is impossible. I just don't get it.
Thanks for your time!
Best,
Adrien.
Comment