How to get data "Stata-ready"?

Vanter Birad

Join Date: Apr 2014

Posts: 23
#1

How to get data "Stata-ready"?

14 Mar 2018, 08:44

Hi all,

I have not used Stata for long and have now to perform a statistical analysis based on financial and merger data I have gathered. I want to make this data "Stata-ready", because it currently seems to not work when loaded. (summarize does not yield any values etc)

The issue with the data is that, for each firm, I have circa 40 variables per year from 2012 to 2017 (i.e. revenue, EBITDA, market value etc.). Also, there are a lot of datapoints that contain only missing values as well as a mix of string and numerical values and therefore something Stata cannot apparently work with.

What do I need to do (in Stata or Excel?) so that the data becomes usable? For example, that, based on the datastream Code (DSCD) for each firm I have outputs per year that I can analyse?

Please let me know if that sounds vague - I have attached a snapshot of the dataset I havem maybe that makes things clearer.

Thanks in advance
Attached Files

Master Thesis data snapshot.xlsx (11.5 KB, 1 view)

Last edited by Vanter Birad; 14 Mar 2018, 08:46.
Tags: None
Patrick Abi Nader

Join Date: Jun 2014

Posts: 174
#2

14 Mar 2018, 09:13

Hello Vanter,

To share a sample of your data please follow the advice in FAQ 12.2
From your description it sounds as if Stata is reading a numeric variable as a string variable. This could happen for many reasons and usual solutions are very easy to handle in Stata, check out

destring

or

encode

If you share data following the advice in the FAQ I reference above a few of us might be able to provide you with more specific guidance.

Best wishes,
Patrick
Comment

Vanter Birad

Join Date: Apr 2014
Posts: 23

14 Mar 2018, 09:28

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 FinancialVariable str3 Country str6 DSCD str45 CompanyName str48 y2012 str18(y2013 y2014 y2015 y2016) byte DummyTarget
"MV"      "GER" "882284" "A I S" ".5600000000000001"                    "1.17" "2.18" "1.49" ".24" 0
"PE"      "GER" "882284" "A I S" `"$$"ER", 0904, NO DATA AVAILABLE"'    ""     ""     ""     ""    0
"DPS"     "GER" "882284" "A I S" "0"                                    "0"    "0"    "0"    "0"   0
"EPS"     "GER" "882284" "A I S" "0"                                    "0"    "0"    "0"    "0"   0
"WC03040" "GER" "882284" "A I S" `"$$"ER", 4540, NO DATA VALUES FOUND"' ""     ""     ""     ""    0
"WC08131" "GER" "882284" "A I S" `"$$"ER", 4540, NO DATA VALUES FOUND"' ""     ""     ""     ""    0
"WC05476" "GER" "882284" "A I S" `"$$"ER", 4540, NO DATA VALUES FOUND"' ""     ""     ""     ""    0
"WC04601" "GER" "882284" "A I S" `"$$"ER", 4540, NO DATA VALUES FOUND"' ""     ""     ""     ""    0
"WC02003" "GER" "882284" "A I S" `"$$"ER", 4540, NO DATA VALUES FOUND"' ""     ""     ""     ""    0
"WC02005" "GER" "882284" "A I S" `"$$"ER", 4540, NO DATA VALUES FOUND"' ""     ""     ""     ""    0
"MV"      "GER" "679821" "A S CREATION TAPETEN" "73.83"                                "114.63"            "98.11"  "81.15000000000001" "79.48"  0
"PE"      "GER" "679821" "A S CREATION TAPETEN" "10.9"                                 "15.3"              "16.7"   "NA"                "NA"     0
"DPS"     "GER" "679821" "A S CREATION TAPETEN" "1.35"                                 ".75"               "1.2"    ".25"               ".6"     0
"EPS"     "GER" "679821" "A S CREATION TAPETEN" "2.26"                                 "2.5"               "1.96"   "0"                 "0"      0
"WC03040" "GER" "679821" "A S CREATION TAPETEN" "7352"                                 "9026"              "6058"   "6070"              "7119"   0
"WC08131" "GER" "679821" "A S CREATION TAPETEN" "64"                                   "64"                "68"     "72"                "67"     0
"WC05476" "GER" "679821" "A S CREATION TAPETEN" "34.408"                               "33.912"            "31.524" "33.808"            "35.011" 0
"WC04601" "GER" "679821" "A S CREATION TAPETEN" "5122"                                 "5679"              "4472"   "4545"              "5014"   0
"WC02003" "GER" "679821" "A S CREATION TAPETEN" "6462"                                 "5612"              "5269"   "6401"              "20548"  0
"WC02005" "GER" "679821" "A S CREATION TAPETEN" "7153"                                 "6644"              "8825"   "16468"
end

Please note that for most of the data, of course, there are numbers with some "NA"s here and there, for a year, where no data is available. Financial Variable is a code from "Datastream", that defines a certain variable such as Market Value or P/E Ratio. DSCD is a unique identifier for each company from "Datastream" as well.

My problem is pretty similar to the one here: https://www.statalist.org/forums/for...atastream-data

I have tried reshaping the data with some unique variables, however have not been successful so far. My overall aim is to analyse the specifities of a firm that was targeted and also targeted by certain investor types.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#4

14 Mar 2018, 12:56

So the fundamental problems with this data are:

1. It is long in variable names and wide in dates, whereas you need precisely the reverse.

2. The numeric data in the y* variables has been imported as string because it is contaminated with comments such as "NA" or some very long strings that say, in essence, the same thing.

The steps are to first clean up the y variables to get rid of those superfluous strings. Then it is a double -reshape-.

Code:

foreach v of varlist y2012-y2016 { replace `v' = "" if strpos(`v', "NO DATA") /// | strpos(`v', "NA") } destring y2012-y2016, replace reshape long y, i(DSCD FinancialVariable) j(year) reshape wide y, i(DSCD year) j(FinancialVariable) string rename y* * rename ear year

The code above will work for your example. But it is possible that the y* variables in your full data set contain additional "garbage" that is not covered by my clean-up code. So you may have to first do something like this:

Code:

foreach v of varlist y2012-2016 { tab `v' if missing(real(`v')) }

to get a complete list of the stuff that has to be purged from those variables. Then modify the -replace `v' = ...- commands accordingly.

Also, if you are going to use country or DSCD as grouping variables for analysis, then you probably need to make them numeric:

Code:

encode country, gen(ncountry) egen long nDSCD = group(DSCD)

Then if you need to -xtset- your data for country or DSCD you can use the numeric versions instead.

The reason I chose to use -egen, group()- for DSCD is because, unlike country, I am concerned that there may be a very large number of distinct DSCD values that might exhaust the capabilities of -encode-. If that's not the case, you can just use -encode- for DSCD as well.

Responding to your "Stata or Excel" in #1, you should never do data management in Excel. For serious work, including a master's thesis, you need to have a complete audit trail of everything you did with your data, from the moment you received it down to the final results. Excel does not provide that. Stata, through the use of do-files and logs does. Excel can be useful for passing data among people who do not share a common statistical package, and it can be helpful in creating displays of data for human eyes. But it is completely unsuitable for data management and analysis in any context except just playing around. (Even that is inadvisable because you can inadvertently lose or corrupt your original data while "playing around" and end up with a changed data file and not even know it.) So the rule with data in Excel should be "look, but don't touch."
1 like
Comment
Vanter Birad

Join Date: Apr 2014

Posts: 23
#5

15 Mar 2018, 06:53

Clyde,

thank you very much. This was most helpful, and after I got rid of other, non-unique variables successful to my problem. Very much appreciated
Comment

Announcement