Mike Lacy requested that I post an example of how I used his suggestion in https://www.statalist.org/forums/for.../general/12850 to extract data from different webpages. We are interested in the effect of sociopolitical events on attendance in the Kontinental Hockey League (KHL). The data are available from https://en.khl.ru/calendar/202/00/ and are for the seasons 08/09 - 22/23, representing over 11,000 observations. 
The issue is that the attendance data are embedded within links (click where the scores are displayed), where each link represents a game.
Therefore, manual extraction would entail opening over 11,000 links and copying the wanted information. Instead, we can extract these data automatically using Stata as below (for the 08/09 season). First, I read the data into Stata and then export them as a text file. I then parse the text file to retrieve the information that I want. It helps that the links are standardized to a large degree, and the same code works for all.
The results follow in #2.
The issue is that the attendance data are embedded within links (click where the scores are displayed), where each link represents a game.
Therefore, manual extraction would entail opening over 11,000 links and copying the wanted information. Instead, we can extract these data automatically using Stata as below (for the 08/09 season). First, I read the data into Stata and then export them as a text file. I then parse the text file to retrieve the information that I want. It helps that the links are standardized to a large degree, and the same code works for all.
Code:
cap frame drop myresults
frame create myresults
frame myresults{
set obs 1
gen game=.
gen time=""
gen venue=""
gen attendance=.
}
tempfile b
forval i= 21650/22321{
clear
set obs 1
gen s = fileread("https://en.khl.ru/game/160/`i'/protocol/")
export delimited using myfile2.txt, replace
import delimited "myfile2.txt", clear
keep v1
gen attendance = real(ustrregexra(v1, "[^\d]", "")) if regexm(v1, "Spectators")
gen venue1= v1[_n-4] if !missing(attendance)
gen venue2= v1[_n-2] if !missing(attendance)
gen venue= venue1 + venue2
gen time= v1[_n-15] if !missing(attendance)
gen Game= subinstr(v1,"â", "", 1) if ustrregexm(v1, "Game â")
gen teams= subinstr(v1, "<title>Game summary:","", 1) if ustrregexm(v1, "<title>Game summary:")
collapse (firstnm) attendance venue time Game teams
replace venue= trim(itrim(ustrregexra(venue, "(.*)</p>(.*)</p>$", "$1 $2")))
replace time= trim(ustrregexra(time, ".*(\d{2}:\d{2}).*$", "$1"))
gen game= real(ustrregexra(ustrregexra(trim(itrim(Game)),".*Game (.*) </h2>", "$1"), "[^\d]", ""))
gen home= trim(ustrregexra(teams, "(^.*)[-].*", "$1"))
gen away= trim(ustrregexra(teams, "(^.*)[-](.*)[:].*", "$2"))
drop Game teams
save `b', replace
frame myresults: append using `b'
}
frame change myresults
drop in 1
rename (time venue attendance) (Time Venue Attendance)
gen season="08/09"
gen which="Regular season"

Comment