Hello!
For a university project I am working on a text mining task...I got a .csv-file with data from AirBnB listings in Berlin with a total of 96 variables and around 3000 observations, including data on e.g. the exact location, size of rooms, price, and title/description of the offer as displayed on the AirBnB website.
I wanna analyse the small title/description text parts- What language are they written in (English, German or bilingual)? How many words/characters were used? What's the frequency of certain keywords? I wanna filter this by the different districts of Berlin, e.g. maybe in district 1 more hosts tend to write English titles to attract foreign tourists? Maybe in district 2 there is a high frequency of location-focused keywords in the title, whereas in district 3 there is a high frequency of emotional keywords?
I am not a very experienced user of Stata, but after some first research into this particular topic, I found that there are ways to do some basic text mining in Stata.
(See this paper: https://poseidon01.ssrn.com/delivery...085069&EXT=pdf or also http://www.stata.com/manuals14/fnstringfunctions.pdf)
I do not really understand the approach though. Especially the word counting command does not seem to work for me.
As an alternative to using the basic Stata, I found out that I could try to do it with WordStat and QDA Data Miner. I would prefer a simple Stata solution though, without having to learn these two other programs first.
Has anybody here done a similar simple text mining task with Stata before? Is it even possible? I would highly appreciate any kind of help! :-)
Thank you in advance!
Best
Flor
For a university project I am working on a text mining task...I got a .csv-file with data from AirBnB listings in Berlin with a total of 96 variables and around 3000 observations, including data on e.g. the exact location, size of rooms, price, and title/description of the offer as displayed on the AirBnB website.
I wanna analyse the small title/description text parts- What language are they written in (English, German or bilingual)? How many words/characters were used? What's the frequency of certain keywords? I wanna filter this by the different districts of Berlin, e.g. maybe in district 1 more hosts tend to write English titles to attract foreign tourists? Maybe in district 2 there is a high frequency of location-focused keywords in the title, whereas in district 3 there is a high frequency of emotional keywords?
I am not a very experienced user of Stata, but after some first research into this particular topic, I found that there are ways to do some basic text mining in Stata.
(See this paper: https://poseidon01.ssrn.com/delivery...085069&EXT=pdf or also http://www.stata.com/manuals14/fnstringfunctions.pdf)
I do not really understand the approach though. Especially the word counting command does not seem to work for me.
As an alternative to using the basic Stata, I found out that I could try to do it with WordStat and QDA Data Miner. I would prefer a simple Stata solution though, without having to learn these two other programs first.
Has anybody here done a similar simple text mining task with Stata before? Is it even possible? I would highly appreciate any kind of help! :-)
Thank you in advance!
Best
Flor