Hi everyone,
I am a second-year PhD student, currently having some trouble with a panel dataset containing information from a social media platform. My intention is to analyze how certain sides of a platform influence the growth and intensity from other sides. The data timespan is 2016-2023 and the monthly variables cover the total installed base of the platform and its different languages (thus, it is not a sample). These are variables such as platform usage, number of providers, providers’ intensity or platform variety (number of categories) ... The problem is that, when correlating in stata every variable I have, all figures surpass 0.9, even reaching 0.97, 0.98 or 0.99 in several occasions. I was wondering whether this is normal and, if so, how is it possible to correct it in order to make a proper regression without multicollinearity problems.
In addition, I checked other data sources in order to make sure that the information I have is correct, which was the case. I also used an alternative database sample with information on individual platform providers (instead of information for the whole platform, such as the one abovementioned). In this provider dataset, variables depict low to medium correlations, but definitely not high ones. However, if I aggregate provider information (thus making a total platform sample similar to the one abovementioned, which in this case covers around 70% of the market), all variables depict once again correlations above 0.9. In this database, variables such as Platform usage, Providers intensity and Providers variety show correlations in stata above 0.9 when aggregated. If not, all their correlations are below 0.24.
I wonder if this phenomenon is possible and how I should approach it in order to be able to measure the effects I intend. Thanks in advance.
At the provider level:
HoursWatched is the total number of hours that a provider has been watched in a month (performance variable)
HoursStreamed is the total number of hours that a provider has been active on the platform (intensity)
Games is the number of games/categories that a provider has used in a month (variety)
When aggregated (summed by language):
HWsum: total number of hours that all providers speaking a specific language have been watched in a month (performance variable for a specific language)
HSsum: total number of hours that all providers speaking a specific language have been active in a month (intensity of a specific language)
Gsum: total number of categories/ games that all providers speaking a specific language have used

I am a second-year PhD student, currently having some trouble with a panel dataset containing information from a social media platform. My intention is to analyze how certain sides of a platform influence the growth and intensity from other sides. The data timespan is 2016-2023 and the monthly variables cover the total installed base of the platform and its different languages (thus, it is not a sample). These are variables such as platform usage, number of providers, providers’ intensity or platform variety (number of categories) ... The problem is that, when correlating in stata every variable I have, all figures surpass 0.9, even reaching 0.97, 0.98 or 0.99 in several occasions. I was wondering whether this is normal and, if so, how is it possible to correct it in order to make a proper regression without multicollinearity problems.
In addition, I checked other data sources in order to make sure that the information I have is correct, which was the case. I also used an alternative database sample with information on individual platform providers (instead of information for the whole platform, such as the one abovementioned). In this provider dataset, variables depict low to medium correlations, but definitely not high ones. However, if I aggregate provider information (thus making a total platform sample similar to the one abovementioned, which in this case covers around 70% of the market), all variables depict once again correlations above 0.9. In this database, variables such as Platform usage, Providers intensity and Providers variety show correlations in stata above 0.9 when aggregated. If not, all their correlations are below 0.24.
I wonder if this phenomenon is possible and how I should approach it in order to be able to measure the effects I intend. Thanks in advance.
Example with Platform usage (Hours Watched), Providers intensity (Hours Streamed) and Providers variety (Games)
At the provider level:
HoursWatched is the total number of hours that a provider has been watched in a month (performance variable)
HoursStreamed is the total number of hours that a provider has been active on the platform (intensity)
Games is the number of games/categories that a provider has used in a month (variety)
When aggregated (summed by language):
HWsum: total number of hours that all providers speaking a specific language have been watched in a month (performance variable for a specific language)
HSsum: total number of hours that all providers speaking a specific language have been active in a month (intensity of a specific language)
Gsum: total number of categories/ games that all providers speaking a specific language have used
Comment